EPUB 是一种开放的行业标准电子书格式。然而,对 EPUB 及其许多功能的支持因阅读设备和应用程序而异。使用您的设备或应用程序设置根据您的喜好自定义演示文稿。您可以自定义的设置通常包括字体、字体大小、单列或双列、横向或纵向模式以及可以单击或点击放大的图形。有关阅读设备或应用程序的设置和功能的更多信息,请访问设备制造商的网站。
EPUB is an open, industry-standard format for e-books. However, support for EPUB and its many features varies across reading devices and applications. Use your device or app settings to customize the presentation to your liking. Settings that you can customize often include font, font size, single or double column, landscape or portrait mode, and figures that you can click or tap to enlarge. For additional information about the settings and features on your reading device or app, visit the device manufacturer’s Web site.
许多标题都包含编程代码或配置示例。要优化这些元素的呈现,请以单栏、横向模式查看电子书,并将字体大小调整为最小设置。除了以可重排文本格式呈现代码和配置之外,我们还提供了模仿印刷书中演示的代码图像;因此,在可回流格式可能会影响代码列表的呈现的情况下,您将看到“单击此处查看代码图像”链接。单击链接可查看打印保真度代码图像。要返回到上一个查看的页面,请单击设备或应用程序上的“后退”按钮。
Many titles include programming code or configuration examples. To optimize the presentation of these elements, view the e-book in single-column, landscape mode and adjust the font size to the smallest setting. In addition to presenting code and configurations in the reflowable text format, we have included images of the code that mimic the presentation found in the print book; therefore, where the reflowable format may compromise the presentation of the code listing, you will see a “Click here to view code image” link. Click the link to view the print-fidelity code image. To return to the previous page viewed, click the Back button on your device or app.
构建有弹性的现代化网络的创新方法
An innovative approach to building resilient, modern networks
波士顿 • 哥伦布 • 印第安纳波利斯 • 纽约 • 旧金山 阿姆斯特丹 • 开普敦 • 迪拜 • 伦敦 • 马德里 • 米兰 慕尼黑 • 巴黎 • 蒙特利尔 • 多伦多 • 德里 • 墨西哥城 • 圣保罗 悉尼 • 香港 • 首尔 • 新加坡 • 台北 • 东京
Boston • Columbus • Indianapolis • New York • San Francisco Amsterdam • Cape Town • Dubai • London • Madrid • Milan Munich • Paris • Montreal • Toronto • Delhi • Mexico City • São Paulo Sidney • Hong Kong • Seoul • Singapore • Taipei • Tokyo
制造商和销售商用来区分其产品的许多名称都被称为商标。如果这些名称出现在本书中,并且出版商知道商标声明,则这些名称均以首字母大写或全部大写印刷。
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and the publisher was aware of a trademark claim, the designations have been printed with initial capital letters or in all capitals.
作者和出版商在准备本书的过程中非常谨慎,但没有做出任何形式的明示或暗示的保证,并且对错误或遗漏不承担任何责任。对于因使用此处包含的信息或程序而产生的偶然或间接损害,我们不承担任何责任。
The authors and publisher have taken care in the preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein.
有关批量购买本书的信息,或特殊销售机会(可能包括电子版本;定制封面设计;以及针对您的业务、培训目标、营销重点或品牌利益的特定内容),请联系我们的公司销售部门请发送电子邮件至corpsales@pearsoned.com或 (800) 382-3419。
For information about buying this title in bulk quantities, or for special sales opportunities (which may include electronic versions; custom cover designs; and content particular to your business, training goals, marketing focus, or branding interests), please contact our corporate sales department at corpsales@pearsoned.com or (800) 382-3419.
对于政府销售查询,请联系governmentsales@pearsoned.com。
For government sales inquiries, please contact governmentsales@pearsoned.com.
有关美国境外销售的问题,请联系intlcs@pearson.com。
For questions about sales outside the U.S., please contact intlcs@pearson.com.
请访问我们的网站:informit.com/aw
Visit us on the Web: informit.com/aw
美国国会图书馆控制号:2017958319
Library of Congress Control Number: 2017958319
版权所有 © 2018 培生教育公司
Copyright © 2018 Pearson Education, Inc.
版权所有。美国印刷。本出版物受版权保护,在进行任何禁止复制、存储在检索系统中或以任何形式或方式(电子、机械、复印、记录或类似方式)进行传输之前,必须获得出版商的许可。有关权限、申请表以及培生教育全球权利与权限部门内相应联系人的信息,请访问www.pearsoned.com/permissions/。
All rights reserved. Printed in the United States of America. This publication is protected by copyright, and permission must be obtained from the publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any form or by any means, electronic, mechanical, photocopying, recording, or likewise. For information regarding permissions, request forms and the appropriate contacts within the Pearson Education Global Rights & Permissions Department, please visit www.pearsoned.com/permissions/.
ISBN-13: 978-1-58714-504-9
ISBN-10: 1-58714-504-9
ISBN-13: 978-1-58714-504-9
ISBN-10: 1-58714-504-9
1 17
1 17
主编马克·
陶布
Editor-in-Chief
Mark Taub
产品线经理
布雷特·巴托
Product Line Manager
Brett Bartow
开发编辑
克里斯托弗·克利夫兰
Development Editor
Christopher Cleveland
总编辑
桑德拉·施罗德
Managing Editor
Sandra Schroeder
高级项目编辑
托尼亚·辛普森
Senior Project Editor
Tonya Simpson
文案编辑
查克·哈钦森
Copy Editor
Chuck Hutchinson
索引器
肯·约翰逊
Indexer
Ken Johnson
校对员
阿比盖尔·曼海姆
Proofreader
Abigail Manheim
技术审阅者
Peter Welcher、Jordan Martin
Technical Reviewers
Peter Welcher, Jordan Martin
出版协调员
凡妮莎·埃文斯
Publishing Coordinator
Vanessa Evans
封面设计师
Chuti Prasertsith
Cover Designer
Chuti Prasertsith
合成器
代码Mantra
Compositor
codeMantra
To Lori, my beautiful wife of 20 years.
致布鲁斯·利特尔和道格·布克曼;挑战我的思考。
To Bruce Little and Doug Bookman; for challenging me to think.
致 Brett Bartow、Eyvonne Sharp、Phil Gervasi 和 Jordan Martin;为了激励我。
To Brett Bartow, Eyvonne Sharp, Phil Gervasi, and Jordan Martin; for inspiring me.
愿上帝祝福你们每个人,因为你们给我的生活带来了祝福。
May God bless each of you for the blessings you have brought into my life.
—拉斯·怀特
—Russ White
前往夏暮岛;让我能够追求我必须追求的东西。
To Summerset; for enabling me to pursue the things I must chase.
致德鲁·康里-默里;寻求同情、建议和鼓励。
To Drew Conry-Murray; for commiseration, advice, and encouragement.
致罗宾·杨和格雷格·费罗;为了自由写作和道德支持。
To Robin Young and Greg Ferro; for freedom to write and moral support.
致乔丹·马丁;因为不说不。
To Jordan Martin; for not saying no.
致 Packet Pushers 社区;他们发出了无数的声音,既有沮丧的,也有胜利的。
To the Packet Pushers community; for their multiplied voices, both frustrated and victorious.
—伊桑·班克斯
—Ethan Banks
Chapter 1: Fundamental Concepts
Flow Control in Packet Switched Networks
Fixed Versus Variable Length Frames
The Revenge of Centralized Control Planes
Managing Complexity through the Wasp Waist
Chapter 2: Data Transport Problems and Solutions
Digital Grammars and Marshaling
Digital Grammars and Dictionaries
Addressing Devices and Applications
Chapter 3: Modeling Network Transport
United States Department of Defense (DoD) Model
Open Systems Interconnect (OSI) Model
Recursive Internet Architecture (RINA) Model
Connection Oriented and Connectionless
Chapter 4: Lower Layer Transports
Data Marshaling, Error Control, and Flow Control
Final Thoughts on Lower Layer Transmission Protocols
Chapter 5: Higher Layer Data Transports
Chapter 6: Interlayer Discovery
Interlayer Discovery Solutions
Well-Known and/or Manually Configured Identifiers
Advertising Identifier Mappings in a Protocol
Calculating One Identifier from the Other
IPv4 Address Resolution Protocol
Final Thoughts on Packet Switching
Why Not Just Size Links Large Enough?
Timeliness: Low-Latency Queueing
Fairness: Class-Based Weighted Fair Queueing
Other QoS Congestion Management Tools
Managing a Full Buffer: Weighted Random Early Detection
Managing Buffer Delay, Bufferbloat, and CoDel
Final Thoughts on Quality of Service
Chapter 9: Network Virtualization
Understanding Virtual Networks
Providing Ethernet Services over an IP Network
Virtual Private Access to a Corporate Network
A Summary of Virtualization Problems and Solutions
Segment Routing with Multiprotocol Label Switching
Signaling Segment Routing Labels
Software-Defined Wide Area Networks
Interaction Surfaces and Shared Risk Link Groups
Interaction Surfaces and Overlaid Control Planes
Final Thoughts on Network Virtualization
Chapter 10: Transport Security
Protecting Data from Being Examined
Final Thoughts on Transport Security
Chapter 11: Topology Discovery
Nodes, Edges, and Reachable Destinations
Detecting Other Network Devices
Detecting Two-Way Connectivity
Detecting the Maximum Transmission Unit
Learning about Reachable Destinations
Advertising Reachability and Topology
Deciding When to Advertise Reachability and Topology
Reactive Distribution of Reachability
Proactive Distribution of Reachability
Redistribution between Control Planes
Redistribution and Routing Loops
Final Thoughts on Topology Discovery
Chapter 12: Unicast Loop-Free Paths (1)
Waterfall (or Continental Divide) Model
Bellman-Ford Loop-Free Path Calculation
Garcia’s Diffusing Update Algorithm
Chapter 13: Unicast Loop-Free Paths (2)
Dijkstra’s Shortest Path First
Suurballe’s Disjoint Path Algorithm
Chapter 14: Reacting to Topology Changes
Event-Driven Failure Detection
Comparing Event-Driven and Polling-Based Detection
An Example: Bidirectional Forwarding Detection
Consistency, Accessibility, and Partitionability
Chapter 15: Distance Vector Control Planes
Learning about Reachable Destinations
Concluding Thoughts on the Spanning Tree Protocol
The Routing Information Protocol
The Enhanced Interior Gateway Routing Protocol
Neighbor Discovery and Reliable Transport
Chapter 16: Link State and Path Vector Control Planes
A Short History of OSPF and IS-IS
The Intermediate System to Intermediate System Protocol
Neighbor and Topology Discovery
The Open Shortest Path First Protocol
Neighbor and Topology Discovery
Common Elements of OSPF and IS-IS
Conceptualizing Links, Nodes, and Reachability in Link State Protocols
Validating Two-Way Connectivity in SPF
The BGP Best Path Decision Process
Chapter 17: Policy in the Control Plane
Control Plane Policy Use Cases
Flow Pinning for Application Optimization
Control Plane Policy and Complexity
Final Thoughts on Control Plane Policy
Chapter 18: Centralized Control Planes
Considering the Definition of Software Defined
Considering the Division of Labor
Final Thoughts on Centralized Control Planes
Chapter 19: Failure Domains and Information Hiding
Defining Control Plane State Scope
Summarizing Topology Information
Aggregating Reachability Information
Filtering Reachability Information
Final Thoughts on Hiding Information
Chapter 20: Examples of Information Hiding
Summarizing Topology Information
Intermediate System to Intermediate System
The Border Gateway Protocol as a Reachability Overlay
Segment Routing with a Controller Overlay
Final Thoughts on Failure Domains
Chapter 21: Security: A Broader Sweep
The Biometric Identity Conundrum
Service Availability Assurance
The OODA Loop as a Security Model
Chapter 22: Network Design Patterns
Translating Business Requirements into Technical
What Is a Good Network Design?
Planar, Nonplanar, and Regular
Final Thoughts on Network Design Patterns
Chapter 23: Redundant and Resilient
The Problem Space: What Failures Look Like to Applications
Redundancy as a Tool to Create Resilience
In-Service Software Upgrade and Graceful Restart
Final Thoughts on Troubleshooting
Chapter 25: Disaggregation, Hyperconvergence, and the Changing Network
Changes in Compute Resources and Applications
Converged, Disaggregated, Hyperconverged, and Composable
Applications Virtualized and Disaggregated
The Special Properties of a Fabric
Traffic Engineering on a Spine and Leaf
Final Thoughts on Disaggregation
Chapter 26: The Case for Network Automation
Automation with Programmatic Interfaces
Network Automation with Infrastructure Automation Tools
Network Controllers and Automation
Network Automation for Deployment
Final Thoughts on the Future of Network Automation: Automation to Automatic
Chapter 27: Virtualized Network Functions
Decreased Time to Service through Automation
Compute Advantages and Architecture
Chapter 28: Cloud Computing Concepts and Challenges
Shifting from Capital to Operational Expenditure
Time-to-Market and Business Agility
Nontechnical Public Cloud Tradeoffs
Technical Challenges of Cloud Networking
Selecting Among Multiple Paths to the Public Cloud
Protecting Data over Public Transport
Chapter 29: Internet of Things
Securing Insecurable Devices Through Isolation
Final Thoughts on the Internet of Things
Looking Forward Toward Pervasive Automation
Machine Learning and Artificial Narrow Intelligence
Named Data Networking and Blockchains
Named Data Networking Operation
首先,如果拉迪亚·帕尔曼没有认识到这一需求,就不会写出这本书,从而埋下了本书的种子。然而,除了种子之外,一本书并不代表两位作者的作品;而是代表两位作者的作品。许多人实际上参与了您现在可以访问的高质量内容的创建和发布过程。以下是参与此内容创建的人员的(希望是完整的)列表。
To begin, this book would not have been written if the need had not been recognized by Radia Perlman, hence planting the seed of the idea this book grew in to. Beyond the seed, however, a book does not represent the work of two authors; many people are actually involved in the process of creating and publishing the kind of high-quality content you now have access to. Below is a (hopefully complete) list of those who have participated in the creation of this content.
Ignas Bagdonas 是 Equinix 的架构师,专注于互连结构和网络自动化的大规模设计。Ignas 实施了 BGP,作为他在 Routing System, Ltd. 工作的一部分。
Ignas Bagdonas is an architect at Equinix, where he focuses on large-scale design of interconnection fabrics and network automation. Ignas has implemented BGP as part of his work at Routing System, Ltd.
Chris Kane 目前是 Arista Networks 的系统工程师,致力于设计和部署大型网络,并且是俄亥俄州网络用户组的创始成员。Chris 已在网络行业工作超过 25 年,曾在服务提供商、金融、零售和咨询等多个垂直领域工作。
Chris Kane is currently a systems engineer for Arista Networks, where he works on designing and deploying large-scale networks and is a founding member of the Ohio Networking User Group. Chris has been in the networking industry for over 25 years now, having worked in various verticals including Service Provider, Financial, Retail, and Consulting.
Kim Pedersen,CCIE 29189,CCDE 2017:0021,是 Lytzen IT A/S 的网络工程师,专注于网络设计以及国际 MPLS 网络的维护和开发。他热衷于学习新技术主题,并且是所有网络事物的狂热读者。他和妻子住在丹麦,喜欢旅行!
Kim Pedersen, CCIE 29189, CCDE 2017:0021, is a network engineer at Lytzen IT A/S, where he focuses on network design and the maintenance and development of international MPLS networks. He has a passion for learning new technical topics and is an avid reader of all things networking. He lives in Denmark with his wife and enjoys traveling!
Nick Russo,CCIE 42518,CCDE 2016:0041,是马里兰州阿德丁地区 Cisco Systems 的网络工程师,专注于服务提供商、大规模 MPLS、移动性设计以及网络自动化。Nick 是《CCIE 服务提供商第 4 版笔试和实验考试综合指南》的作者,该指南可在 LeanPub 上找到。
Nick Russo, CCIE 42518, CCDE 2016:0041, is a network engineer at Cisco Systems in the Aderdeen, Maryland area, where he focuses on service provider, large-scale MPLS, and mobility design, as well as network automation. Nick is the author of the CCIE Service Provider Version 4 Written and Lab Exam Comprehensive Guide, available on LeanPub.
Maria Urlea,CCDP、CCDA、CCNP、CCNA,是加拿大安大略省思科系统公司的系统工程师。Maria 获得过多项硕士奖学金和学生研究奖,并专注于多家大型网络运营商的网络设计和架构。
Maria Urlea, CCDP, CCDA, CCNP, CCNA, is a systems engineer at Cisco Systems in Ontario, Canada. Maria has received several master’s scholarships and student research awards, and focuses on network design and architecture for several large network operators.
Chris Cleveland 是网络工程领域最优秀的开发编辑之一;自 1997 年以来,他与 Russ 和 Pearson 合作开展了 13 个项目。
Chris Cleveland is one of the finest development editors in the network engineering space; he has worked with Russ on 13 projects in conjunction with Pearson since 1997.
Russ White, CCIE No. 2635,CCDE 2007::1,CCAr,在设计、部署、破坏和故障排除大型网络方面拥有 30 多年的经验。在此期间,他与人合着了 40 多项软件专利,在世界各地发表演讲,参与了多项互联网标准的制定,帮助开发了 CCDE 和 CCAr,并与互联网协会一起从事互联网治理工作。Russ 目前是 LinkedIn 架构团队的成员,负责下一代数据中心设计、复杂性、安全性和隐私方面的工作。他目前还是 IETF 路由区域理事会的成员,并担任 IETF I2RS 和 BABEL 工作组的联合主席。他最近的著作是《网络架构的艺术》和《导航网络复杂性》。
Russ White, CCIE No. 2635, CCDE 2007::1, CCAr, has more than 30 years of experience in designing, deploying, breaking, and troubleshooting large-scale networks. In that time, he has co-authored more than 40 software patents, spoken at venues throughout the world, participated in the development of several Internet standards, helped develop the CCDE and the CCAr, and worked in Internet governance with the Internet Society. Russ is currently a member of the architecture team at LinkedIn, where he works on next-generation data center designs, complexity, security, and privacy. He is also currently on the routing area directorate at the IETF and co-chairs the IETF I2RS and BABEL working groups. His most recent books are The Art of Network Architecture and Navigating Network Complexity.
拉斯拥有卡佩拉大学的 MSIT 学位、牧羊人神学院的 MACM 学位以及东南神学院的博士学位。
Russ holds an MSIT from Capella University, a MACM from Shepherds Theological Seminary, and a PhD in progress from Southeastern Theological Seminary.
Ethan Banks, CCIE 编号 20655,路由与交换,自 1995 年以来一直从事 IT 行业,其职业生涯早期担任 Novell、Windows 和 Linux 环境的系统工程师。后来,他成为一名互联网服务工程师,在一家地区 ISP 从事 DNS、SMTP、HTTP 和相关应用程序的工作。他主要担任高等教育、州政府、咨询、金融和技术等垂直行业企业的网络工程师和架构师。曾担任高级网络工程师、网络运营经理、技术服务经理、网络架构经理、高级网络架构师等职称。
Ethan Banks, CCIE No. 20655, Routing & Switching, has been in IT since 1995, working early in his career as a systems engineer for Novell, Windows, and Linux environments. He later became an Internet services engineer working with DNS, SMTP, HTTP, and related applications at a regional ISP. He predominantly has been a network engineer and architect for enterprises in verticals including higher education, state government, consulting, finance, and technology. He has held titles such as senior network engineer, network operations manager, technical services manager, network architecture manager, and senior network architect.
2010 年,Ethan 与他人共同创立了 Packet Pushers Interactive,这是一家媒体公司,其主要产品是每周播客,全世界有超过 10,000 名网络工程师收听。
In 2010, Ethan co-founded Packet Pushers Interactive, a media company whose premier product is a weekly podcast listened to by more than 10,000 network engineers all over the world.
Ethan 是一位作家,其内容可以在Network World、Network Computer、InformationWeek、Modern Infrastructure和TechTarget等媒体上找到。Ethan 还维护自己的博客,在 ethincbanks.com 上撰写有关技术的文章。Ethan 为 SolarWinds、Nuage Networks、CloudGenix 和 NetBrain Technologies 撰写和/或编辑了白皮书。他目前是 Interop 网络未来联合主席。
Ethan is a writer whose content can be found in Network World, Network Computing, InformationWeek, Modern Infrastructure, and TechTarget, among other outlets. Ethan also maintains his own blog where he writes about technology at ethancbanks.com. Ethan has written and/or edited whitepapers for SolarWinds, Nuage Networks, CloudGenix, and NetBrain Technologies. He is currently the Future of Networking co-chair for Interop.
Ethan 拥有佛罗里达州彭萨科拉市彭萨科拉基督教学院计算机科学和工商管理理学学士学位,并于 1993 年以优异成绩毕业。在过去,Ethan 曾获得认证网络软件工程师、微软认证系统工程师、思科认证网络专家、认证道德黑客和思科认证安全专家等头衔。
Ethan holds a Bachelor of Science degree in Computer Science & Business Administration from Pensacola Christian College in Pensacola, Florida where he graduated Summa Cum Laude in 1993. In the past, Ethan was certified as a Certified Netware Engineer, Microsoft Certified Systems Engineer, Cisco Certified Network Professional, Certified Ethical Hacker, and Cisco Certified Security Professional, among other titles.
有很多方法可以教授(或理解)计算机网络操作的基础知识。例如,一种相当传统的方法是从总体检查控制平面的操作开始,从建立邻居邻接关系到携带信息到建立路由。另一种常见的方法是从模型开始,例如开放系统互连 (OSI) 模型,并从模型内描述协议的操作。这些方法显然对于教授工程师和工科学生计算机网络如何工作非常有用,因为在过去的 30 年里,它们已被用来教授数千甚至数十万的网络工程师。
There are many ways to approach teaching (or understanding) the fundamentals of computer network operation. For instance, one rather traditional way is to begin by examining the operation of a control plane in total, from building neighbor adjacencies to carrying information to building routes. Another common method is to start with a model, such as the Open Systems Interconnect (OSI) model, and describe the operation of the protocols from within the model. These methods have obviously been useful in teaching engineers and engineering students about how computer networks work, as they have been used to teach thousands, perhaps hundreds of thousands, of network engineers over the last 30 years.
但是,在本文作者看来,它们并没有达到应有的效果。尽管在实验室中花费了大量时间,阅读技术材料,甚至配置和部署网络设备,但仍然有许多工程师不了解计算机网络实际工作原理的基础知识。大量网络工程师和工科学生的基本心智技能还有很大差距需要填补。
But—in the view of the authors writing here—they have not been as effective as they could be. There are still many engineers who do not understand the basics of how a computer network actually works, in spite of many hours spent in labs, reading technical material, and even configuring and deploying network equipment. There is still a large gap in the fundamental mental skills of a large number of network engineers and engineering students that needs to be filled.
本书旨在填补这一空白——不仅针对现有的工程师,而且针对所有试图学习计算机网络如何工作的学生,即使网络工程不是他们的最终职业目标。如果你是一名计算机科学专业的学生、一名拥有 20 年经验的网络工程师、一个刚刚学习网络工程的人,甚至是一名负责“网络”的业务经理,这本书可以为你提供帮助。
This book aims to fill that gap—not only for existing engineers, but also for all students who are trying to learn how computer networks work, even if network engineering is not their ultimate career goal. If you are a computer science student, a network engineer with 20 years of experience, someone just trying to learn network engineering, or even a business manager in charge of “the network,” this book has something to offer you.
本书诞生于两位作者在网络工程领域 50 多年的综合经验,这些年来,他们涉猎了从转发设备到控制平面、从存储到计算的各个领域。作者(和审稿人!)花费了数千个小时,通过各种形式和场所,在正式和非正式培训中教授网络工程涉及的许多不同技能。本书的组织是花费大量时间考虑如何最好地处理计算机的许多方面的结果网络技术。详细考虑了哪些有效和(更重要的是)哪些无效,直到最终出现了一个计划,作者认为该计划将对计算机网络领域及其周围的尽可能多的人有所帮助。
This book was born of more than 50 years of combined experience in the field of network engineering, split between two authors who have, over those years, taken in everything from forwarding devices to control planes to storage to compute. The authors (and reviewers!) have spent thousands of hours teaching the many different crafts involved in network engineering in formal and informal training, across a wide range of formats and venues. The organization of this book is the result of numerous hours spent considering how best to approach the many aspects of computer network technologies. What works and (more importantly) what does not work were considered in detail, until a plan finally emerged that the authors believe will be helpful to the largest possible set of people in and around the computer networking field.
本书的组织始于互联网工程任务组 (IETF) 征求意见 (RFC) 1925 “十二个网络真理”中列出的种子。规则 11 规定:
The organization of this book begins with a seed laid out in the Internet Engineering Task Force (IETF) Request for Comments (RFC) 1925, The Twelve Networking Truths. Rule 11 states:
每一个旧的想法都会以不同的名称和不同的表现形式再次提出,无论它是否有效。
Every old idea will be proposed again with a different name and a different presentation, regardless of whether it works.
虽然这显然很幽默,但如果没有至少一点点真相,幽默就不会有趣。就规则 11 而言,不仅仅有一点道理:规则 11 中隐藏着看待技术的完整方式以及技术变革的步伐,这可以彻底改变工程师学习技术的方式。如果每个想法确实都会被再次提出,那么每个想法也确实之前已经被提出过。如果能够在第一次提出某个想法时了解其背后的基本概念,那么将来就应该能够理解基于相同想法的每一个新提案。
While this is clearly humorous, humor would not be funny without at least a grain of truth. In the case of rule 11, there is more than a grain of truth: buried in rule 11, there is an entire way of looking at technology, and the pace of technological change, that can revolutionize the way engineers learn technology. If it is true that every idea will be proposed again, then it is also true that every idea has been proposed before. If it were possible to learn the basic concepts behind an idea the first time it is proposed, it should be possible to understand every new proposal grounded in the same ideas in the future.
这种观察——使计算机网络发挥作用的技术背后的基本思想并没有真正改变——推动了本书所使用的教学方法。本书没有关注模型或协议,而是遵循独特的模式。
This observation—the grounding ideas behind the technologies that make computer networks work do not really change—is what drives the teaching method used in this book. Instead of focusing on models or protocols, this book follows a distinct pattern.
那么,本书的主题是:要真正理解计算机网络,您需要提出并回答三个问题:问题是什么?可能的解决方案有哪些?这些解决方案实施后会是什么样子?
The thesis of this book is, then: To truly understand computer networks, you need to ask and answer three questions: What is the problem? What are the possible solutions? What do these solutions look like when they are implemented?
本书分为三个主要部分,涵盖数据传输、控制平面和具体设计(或更确切地说是技术)情况。在每个部分中,都有一些章节,首先提出一个基本问题:问题是什么?以有意义的方式描述问题集通常会涉及更多的理论工作,因此这些章节乍一看可能不太实用。
This book is divided into three major parts covering data transport, the control plane, and specific design (or rather technology) situations. Within each of these parts, there are sets of chapters that begin by asking a basic question: what is the problem? Describing the problem set in a meaningful way will often involve a good bit more theory work, so these chapters may not, at first, seem to be very practical.
然而,这些章节非常实用。如果没有对问题的深入理解,几乎不可能在正确的背景下真正理解任何提出或实施的解决方案。了解基本问题可以让您做两件事:
These chapters, however, are extremely practical; without a solid understanding of the problem, it is almost impossible to really understand any proposed or implemented solution in the correct context. Understanding the fundamental problems allows you to do two things:
• 将您现在面临的问题(可能看起来是新的或独特的问题)与过去在网络工程中解决的常见问题联系起来。
• Relate problems you are facing right now, problems that might appear to be new, or unique, to a common body of problems solved in network engineering in the past.
• 清楚地看到并理解较大系统中的组件问题,因此有很好的机会以构建完整且连贯的系统的方式对每个问题应用全方位的解决方案。
• See and understand the component problems within a larger system clearly, and hence have a solid chance at applying a full range of solutions to each problem in a way that builds a complete and coherent system.
实际上,提出这个问题是真正理解用于解决网络工程问题的技术可以采取的最重要的一步。
Asking this question is, in reality, the most important step you can take in truly understanding the technologies used to solve network engineering problems.
一旦问题暴露出来,本书就会考虑一系列可能的解决方案。这组解决方案将(不一定)限于最常见的解决方案或已实施的解决方案。相反,选择包含的解决方案将(希望)为您提供可用解决方案类型的良好概述。同样,这部分往往是理论性的,特别是描述旨在解决点问题的点解决方案。再次强调,不切实际的外表是错误的——每个解决方案都是一个“工具”,你可以将其添加到你可以用来解决各种问题的心理工具集中。以这种方式将问题和解决方案结合起来,可以建立一套对任何类型的工程师都有用的扎实的心理技能。
Once the problem is laid bare, this book will then consider a range of possible solutions. The set of solutions will not (necessarily) be restricted to the most common solutions, or to implemented solutions. Rather, the solutions chosen for inclusion will (hopefully) provide you with a good overview of the types of solutions available. Again, this part will tend to be theoretical, specifically in describing point solutions designed to solve point problems. Again, the appearance of impracticality will be wrong—each solution is a “tool” you can add to the set of mental tools you can use to solve a wide array of problems. Combining problems and solutions in this way thus builds a solid set of mental skills useful for engineers of any type.
最后,一旦考虑了一组问题和每个问题的一系列解决方案,这些问题和解决方案将被汇总成一组实施示例。在这一部分中,您将看到理论与实践之间的联系:每个协议如何着手解决一组常见问题,然后在一系列解决方案中进行选择来解决这些问题。作者努力为这些部分选择广泛的协议和系统,因此您不仅可以了解解决方案空间,还可以(尽可能在此类作品的范围内)了解计算机网络的历史工程。
Finally, once a set of problems and a range of solutions for each problem have been considered, the problems and solutions will be drawn together into a set of implementation examples. This part is where you will see the connection between theory and practice: how each protocol sets out to solve a common set of problems and then selects among a range of solutions to solve those problems. The authors have striven to choose a wide range of protocols and systems for these parts, so you are not only carried through the solution space, but also (as much as possible within the confines of a work of this type) the history of computer network engineering.
在这个领域写的任何一本书的范围都可能是无穷无尽的——但是这样一本无穷无尽的书不会受到足够的限制而变得有用。为了管理本书的范围和规模,我们对涵盖哪些内容和不涵盖哪些内容做出了多种选择。
Any book written in this field could be endless in scope—but such an endless book would not be constrained enough to be useful. To manage the scope and scale of this book, then, several choices have been made about what to cover and what not to cover.
覆盖分组交换网络;电路交换网络则不然。在分组交换网络中,信息以分组的形式承载,每个分组都包含足够的信息来在网络中从一端到另一端路由分组。发送者和接收者之间没有“固定”的通信线路;只是一组底层数据包转发设备,作为一个完整的系统,尽力传送这些数据包。电路交换网络可以以一种不需要每个数据包携带所需的所有信息的方式分解信息。转发信息,并且每个特定信息流都有商定的路径和资源。
Packet switched networks are covered; circuit switched networks are not. In packet switched networks, information is carried in packets, each of which contains enough information to route the packet through the network, from end to end. There is no “fixed” line of communications between the sender and receiver; just an underlying set of packet forwarding devices that, acting as a complete system, deliver these packets on a best-effort basis. Circuit switched networks can break up information in a way that does not require each packet to carry all the information needed to forward the information, and there are agreed-on paths and resources tied to each particular information flow.
数据和控制平面被覆盖,但管理平面不被覆盖。通常很难确定数据平面在哪里结束和控制平面在哪里开始。同样,通常很难确定控制平面在哪里结束以及管理系统在哪里开始。作者根据其丰富的经验,尝试仅包含与构建和管理可用于通过网络转发数据包的路径相关的主题,而忽略了似乎更侧重于网络管理的主题。
The data and control planes are covered, but not the management plane. It is often difficult to determine where the data plane ends and the control plane begins. Likewise, it is often difficult to determine where the control plane ends and the management system begins. The authors have, based on their extensive experience, attempted to include just those topics related to building and managing paths available for forwarding packets through a network, while leaving out topics that appear to be more network management focused.
这些遗漏并不是对相关主题重要性的陈述;而是对相关主题重要性的陈述。相反,如果要让任何人在合理的时间内写出这样的书,就必须以某种方式确定其范围。
These omissions are not a statement about the importance of the topics in question; rather, a book such as this must be scoped in some way if it is to be writable by any set of humans in anything like a reasonable amount of time.
在很多方面,理解作者打算如何阅读一本书,对于理解如何使用材料,与理解信息的结构或本书试图回答什么问题一样重要。本书旨在面向广泛的读者,从“普通”网络工程师,到未经任何正式培训而尝试学习网络工程的人们,再到大学课堂。
In many ways, understanding how the authors intend a book to be read is just as important of a guide to understanding how to use the material as understanding how the information is structured, or what question the book is trying to answer. This book is designed to reach a broad audience, from the “average” network engineer, to people trying to learn network engineering without any formal training, to college classrooms.
为了跨越这个范围,作者采取了几个具体步骤:
To reach across this scope, the authors have taken several specific steps:
• 正文中提供的材料虽然深度各异(根据特定主题的要求),但将努力保持介绍性的感觉。正文的主要流程将力求尽可能少地使用“大词”和“重符号”。
• The material presented in the main text, while of varying depth (as required by the specific topic), will strive to maintain an introductory feel. The main flow of text will strive to use as few “big words” and “heavy symbols” as possible.
• 更多技术材料、历史旁白以及作者认为对那些试图学习网络工程的人有用的其他材料将放在侧边栏中。
• More technical material, historical asides, and other material that the authors believe will be useful to those trying to learn network engineering will be placed into sidebars.
• 脚注仅用于表彰提出创意的特定作品或因提出特定创意而闻名的特定个人的作品。在其他上下文中通常放置在脚注中的解释将放置在侧边栏中。
• Footnotes will only be provided to give credit to specific works that originated ideas or the works of specific individuals known for originating specific ideas. Explanations that would normally be placed in a footnote in other contexts will be placed in a sidebar.
• 每章末尾将列出更深入的技术论文和资源,供那些想要更深入地研究特定主题的人使用。这些项目将尽可能提供一些有关它们与哪个特定主题相关的信息。
• More deeply technical papers and resources will be listed at the end of each chapter for those who would like to investigate a specific topic more deeply. These items will have some information about which specific topic they are related to where possible.
本书的研究、写作、编辑和制作投入了大量的时间和精力。从事此工作的作者和编辑代表了网络工程各个方面(协议设计和规范、协议实现、网络设计、网络实现、故障排除等)最广泛且通常最深入的经验。希望本书能够为您提供深入而广泛的基础,让您真正了解计算机网络的工作原理,从而为您设计、实现和管理协议和网络奠定基础,从而解决多年的实际问题来。
A great deal of time and effort have gone into researching, writing, editing, and producing this book. The authors and editors who have worked on this represent some of the broadest, and often deepest, experience in every aspect of network engineering—protocol design and specification, protocol implementation, network design, network implementation, troubleshooting, and many others. Hopefully, this book will provide you with a deep and broad foundation from which to truly understand how computer networks work, and hence lay the groundwork you need to design, implement, and manage protocols and networks that will solve real-world problems for many years to come.
在 InformIT 站点上注册您的计算机网络问题和解决方案的副本,以便在更新和/或更正可用时方便地访问它们。要开始注册过程,请访问 informit.com/register 并登录或创建帐户*。输入产品 ISBN (9781587145049),然后单击“提交”。该过程完成后,您将在“注册产品”下找到任何可用的奖励内容。
Register your copy of Computer Networking Problems and Solutions on the InformIT site for convenient access to updates and/or corrections as they become available. To start the registration process, go to informit.com/register and log in or create an account*. Enter the product ISBN (9781587145049) and click Submit. When the process is complete, you will find any available bonus content under Registered Products.
*请务必勾选您希望收到我们的消息的框,以获得该产品未来版本的独家折扣。
*Be sure to check the box that you would like to hear from us to receive exclusive discounts on future editions of this product.
首先,网络的主要工作是将数据从一台连接的主机传送到另一台主机。乍一看似乎很简单,但实际上却充满了问题。一个例子在这里可能会有所帮助;图PI-1用于说明其复杂性。
To begin, the primary job of a network is to carry data from one attached host to another. This might appear to be simple at first glance, but it is actually fraught with problems. An illustration might be helpful here; Figure PI-1 is used to illustrate the complexity.
从插图的左上角开始:
Beginning at the upper-left corner of the illustration:
1. 应用程序生成一些数据。该数据的格式必须允许接收应用程序理解所传输的内容——数据必须被编组。用于编组数据的机制必须在许多方面保持高效,包括快速且易于编码、快速且易于解码、足够灵活以允许在不破坏太多内容的情况下更改编码,以及在编码过程中添加尽可能少的开销。数据传输。
1. The application generates some data. This data must be formatted in a way that allows the receiving application to understand what has been transmitted— the data must be marshalled. The mechanism used to marshal the data must be efficient in many ways, including fast and easy to encode, fast and easy to decode, flexible enough to allow for changes in encoding without breaking too many things, and adding the smallest amount of overhead possible during data transmission.
2. 网络软件需要对数据进行封装,并做好实际传输的准备。网络软件需要以某种方式知道目标主机的地址。连接源和目的地的网络是共享资源,因此必须提供某种形式的多路复用,以便源可以将信息定向到正确的目的地。一般来说,这将涉及某种形式的寻址。
2. The network software needs to encapsulate the data, and get it ready to actually be transmitted. Somehow the network software needs to know the address of the destination host. The network that connects the source and destination is a shared resource, and hence some form of multiplexing must be available so the source can direct the information at the correct destination. Generally this will involve some form of addressing.
3. 数据必须从源头的内存中移出,并转移到适当的网络上,即在网络连接设备之间传输信息的实际线路(或光缆或无线链路)。
3. The data must be moved out of memory at the source and onto the network proper—the actual wire (or optical cable, or wireless link) that will carry the information between network-connected devices.
4. 网络设备必须有某种方法来发现信息的最终目的地(复用问题的第二种形式),并确定在信息在源和源之间传输时是否需要对信息进行任何其他处理。目的地。
4. Network devices must have some way to discover the ultimate destination of the information—a second form of the multiplexing problem—and determine if there is any other processing that needs to be done on the information while it is in transit between the source and destination.
5. 信息在通过网络设备后,必须再次编码并从内存移出到线路上。在信息从内存移动到某种形式的物理介质的每个点上,信息都需要排队;在任何给定时间,要传输的数据通常会多于可以放置到任何特定物理介质上的数据。这就是服务质量发挥作用的地方。
5. The information, after passing through the network device, must once again be encoded and moved out of memory onto the wire. At every point where information is moved from memory to some form of physical media, the information will need to be queued; there will often be more data to transmit than can be placed onto any particular physical media at any given time. This is where quality of service comes into play.
6. 现在必须将通过网络传输的信息从物理介质复制回内存。必须检查错误——这就是错误控制——并且接收器必须有某种方式告诉发送器它存储传入信息的内存不足——这就是流量控制。
6. The information, as carried through the network, must now be copied off the physical media and back into memory. It must be checked for errors—this is error control—and there must be some way for the receiver to tell the transmitter it is running out of memory in which to store the incoming information— this is flow control.
图中中间的网络设备特别令人感兴趣。网络设备(例如路由器、交换机或中间盒)将两个物理介质连接在一起以构建实际的网络。也许最简单的问题是:为什么首先需要这些设备?路由器和交换机显然是复杂的设备,有自己的内部架构(本章将在高层介绍);为什么要给网络增加这种复杂性?有两个根本原因。
The network device in the middle of the diagram is of particular interest. A network device—such as a router, switch, or middle box—connects two physical media together to build an actual network. Perhaps the simplest question to begin with is this: why are these devices required in the first place? Routers and switches are obviously complex devices, with their own internal architecture (which will be covered in this chapter at a high level); why add this complexity to a network? There are two fundamental reasons.
构建这些设备的最初原因是将不同类型的物理介质连接在一起。例如,在建筑物内运行 ARCnet 或粗网以太网(使用网络设备首次发明时的示例)可能是实用的。然而,这些介质可以穿越的距离非常短——大约数百米。以某种方式,这些网络必须在建筑物之间、校园之间、城市之间以及最终在建筑物之间延伸大陆,使用某种多路复用(或反向多路复用)电话电路,如 T1 或 DS3。这两种不同的媒体类型使用不同种类的信令;必须有某种设备可以将一种信号转换为另一种信号。
The original reason for building these devices was to connect different kinds of physical media together. For instance, within a building it might be practical to run ARCnet or thicknet Ethernet (to use examples from the time when network devices were first invented). The distance these media can traverse, however, is very short— on the order of hundreds of meters. Somehow these networks must be extended between buildings, between campuses, between cities, and eventually between continents, using some sort of multiplexed (or inverse multiplexed) telephone circuit like a T1 or DS3. These two different media types use different kinds of signaling; there must be some sort of device that translates one kind of signaling into another.
第二个原因是:规模很快就成为一个问题。物理世界的本质使得在将数据传输到网络上时您有两种选择:
The second reason is this: scale quickly became an issue. The nature of the physical world is such that you have two choices when it comes to putting data on a wire:
• 线材可精确连接两台电脑;在这种情况下,每对计算机都需要与其需要通信的其他所有计算机进行物理连接。
• The wire can connect precisely two computers; in this case, every pair of computers needs to be physically connected to every other computer it needs to communicate with.
• 线路可以在多台计算机之间共享(线路可以是共享媒体)。
• The wire can be shared among many computers (the wire can be a shared media).
要以第一种方式解决问题,您需要大量电线。以第二种方式解决问题似乎是显而易见的解决方案,但它提出了另一组问题 - 具体来说,如何在所有设备之间共享线路上的可用带宽?在某些时候,如果单个共享媒体上有足够的设备,则用于启用资源共享的任何类型的方案本身都会消耗与连接到线路的任何单个设备一样多或更多的带宽。在某些时候,即使是在足够多的主机之间共享的 100G 链路,也会给每个主机留下很少的可用资源。
To solve the problem the first way, you need a lot of wire. Solving the problem the second way seems like the obvious solution, but it presents another set of problems—specifically, how is the bandwidth available on the wire shared among all the devices? At some point, if there are enough devices on a single shared media, any sort of scheme used to enable resource sharing will, itself, consume as much or more bandwidth as any individual device connected to the wire. At some point, even a 100G link, shared among enough hosts, will leave each individual host with very little available resources.
这种情况的解决方案是使用网络设备(路由器或交换机)来分隔两个共享媒体,仅根据需要在两者之间传递流量。通过一些逻辑规划,需要更频繁地相互通信的设备可以放置得更近(就网络拓扑而言),从而节省其他地方的带宽。当然,路由和交换已经远远超出了这些不起眼的起点,但这些是工程师通过将网络设备注入网络来解决的根本问题。
The solution to this situation is the network device—the router or switch—that separates two shared media, only passing traffic between the two as needed. With some logical planning, devices that need to talk to each other more often can be placed closer together (in terms of network topology), conserving bandwidth in other places. Routing and switching has moved far beyond these humble beginnings, of course, but these are the root problems engineers solved by injecting network devices into networks.
除了将信息从源头传输到目的地之外,这个领域还有其他难题需要解决。很多时候,能够虚拟化网络是很有用的,这通常意味着在网络中的两个设备之间创建隧道。
There are other difficult problems to solve in this space beyond the bare carrying of information from a source to a destination; many times it is useful to be able to virtualize the network, which generally means creating a tunnel between two devices in the network.
第一部分中的一系列章节考虑了有时极其困难的问题,即简单地将数据从网络的一端传输到另一端,以及针对每个问题的一系列可能的解决方案。在此过程中,各个章节还探讨了数据传输协议中分层的概念,及其将这个复杂领域分解为更可解决的块的重要性。然而,分层给传输世界带来了一系列问题,因此第一部分还需要考虑如何解决由于引入分层而带来的问题,特别是层间发现。
The series of chapters in Part I consider the sometimes incredibly difficult problems in simply carrying data from one end of a network to the other, along with a range of possible solutions for each of these problems. Along the way, various chapters also explore the concept of layering in data transport protocols, and its importance to breaking this complex domain into more solvable chunks. Layering, however, brings its own set of problems into the transport world, so Part I also needs to consider how to resolve the problems caused by the introduction of layering—specifically, interlayer discovery.
The chapters in this part include:
•第 1 章:基本概念,讨论业务驱动因素、电路交换、数据包交换和网络复杂性
• Chapter 1: Fundamental Concepts, which discusses business drivers, circuit switching, packet switching, and network complexity
•第 2 章:数据传输问题和解决方案,讨论编组数据、字典、语法、元数据、错误检测、纠错、寻址、多路复用、多播、任播和流量控制
• Chapter 2: Data Transport Problems and Solutions, which discusses marshaling data, dictionaries, grammars, metadata, error detection, error correction, addressing, multiplexing, multicast, anycast, and flow control
•第 3 章:网络传输建模,讨论建模的价值、国防部 (DoD) 模型、开放系统互连 (OSI) 模型、递归互联网架构 (RINA) 模型、面向连接和无连接的传输机制
• Chapter 3: Modeling Network Transport, which discusses the value of modeling, the Department of Defense (DoD) model, the Open Systems Interconnect (OSI) model, the Recursive Internet Architecture (RINA) model, connection-oriented and connectionless transport mechanisms
• Chapter 4: Lower Layer Transports, which discusses Ethernet and 802.11 Wireless
•第 5 章:高层数据传输,讨论互联网协议 (IP)、传输控制协议 (TCP)、QUIC 和互联网控制消息协议 (ICMP)
• Chapter 5: Higher Layer Data Transports, which discusses the Internet Protocol (IP), the Transmission Control Protocol (TCP), QUIC, and the Internet Control Message Protocol (ICMP)
•第 6 章:层间发现,讨论层间映射标识符和服务、域名系统 (DNS)、地址解析协议 (ARP)、邻居发现 (ND)、无状态地址自动配置 (SLAAC) 以及默认网关
• Chapter 6: Interlayer Discovery, which discusses mapping identifiers and services between layers, the Domain Name System (DNS), the Address Resolution Protocol (ARP), Neighbor Discovery (ND), Stateless Address Autoconfiguration (SLAAC), and the concept of the default gateway
•第 7 章:数据包交换,讨论从物理介质复制数据包、处理数据包、通过网络设备移动数据包以及最终将数据包复制到物理介质的过程
• Chapter 7: Packet Switching, which discusses the process of copying a packet off of the physical media, processing the packet, moving a packet through the network device, and finally copying a packet onto the physical medium
•第 8 章:服务质量,讨论为什么需要服务质量 (QoS)、流量分类、服务等级、服务类型、QoS 信任边界、抖动和排队公平性
• Chapter 8: Quality of Service, which discusses why Quality of Service (QoS) is needed, traffic classification, Class of Service, Type of Service, QoS trust boundaries, jitter, and fairness in queueing
•第 9 章:网络虚拟化,讨论网络虚拟化、隧道、交换隧道数据包的用例、网络虚拟化必须解决的问题、分段路由 (SR)、软件定义广域网 (SD-WAN)、虚拟化权衡以及命运共同体
• Chapter 9: Network Virtualization, which discusses use cases for network virtualization, tunneling, switching tunneled packets, the problems network virtualization must solve, Segment Routing (SR), Software-Defined Wide Area Networks (SD-WAN), virtualization tradeoffs, and shared fate
•第 10 章:传输安全,讨论数据耗尽、非对称和对称加密、密钥交换、隐藏用户信息、中间人 (MitM) 攻击和传输级安全 (TLS)
• Chapter 10: Transport Security, which discusses data exhaust, asymmetric and symmetric encryption, key exchange, hiding user information, man-inthe-middle (MitM) attacks, and Transport Level Security (TLS)
网络总是被设计来做一件事:将信息从一个附加系统传送到另一个系统。关于完成这项看似简单的任务的最佳方法的讨论(或者也许是争论)已经持续了很长时间,有时伴随着更多的热而不是光,并且经常与相当绝对的人和观点交织在一起。这段历史可以大致分为多个且经常重叠的阶段,每个阶段都提出不同的问题:
Networks were always designed to do one thing: carry information from one attached system to another. The discussion (or perhaps argument) over the best way to do this seemingly simple task has been long-lived, sometimes accompanied by more heat than light, and often intertwined with people and opinions of a rather absolute kind. This history can be roughly broken into multiple, and often overlapping, stages, each of which asked a different question:
• 网络应该采用电路交换还是分组交换?
• Should networks be circuit switched or packet switched?
• 分组交换网络应该使用固定大小的帧还是可变大小的帧?
• Should packet switched networks use fixed- or variable-sized frames?
• 计算一组通过网络的最短路径的最佳方法是什么?
• What is the best way to calculate a set of shortest paths through a network?
• 分组交换网络应如何与服务质量(QoS) 交互?
• How should packet switched networks interact with Quality of Service (QoS)?
• 控制平面应该集中还是分散?
• Should the control plane be centralized or decentralized?
其中一些问题早已得到解答,最常见的是通过将每个立场的更极端元素混合成一个有时混乱但通常总是有用的解决方案。另一方面,其中一些问题仍然存在,特别是最后一个。也许,二十年后,读者也能看到最后一个问题得到了解答。
Some of these questions have been long since answered, most often by blending the more extreme elements of each position into a sometimes messy, but generally always useful, solution. Some of these questions are, on the other hand, still active, particularly the last one. Perhaps, in twenty years’ time, readers will be able to look on this last question as being answered, as well.
本章将从网络工程领域的这些历史运动或阶段的框架内描述本书中使用的基本术语和概念。读完本章后,您将准备好学习本书的前两部分——转发平面和控制平面。第三部分是网络设计概述,它建立在前两部分的基础上。本书的最后部分着眼于一些可能塑造未来的具体技术和运动——不仅是网络工程的未来,甚至是整个信息的生产和处理的未来。
This chapter will describe the basic terms and concepts used in this book from within the framework of these historical movements or stages within the world of network engineering. By the end of this chapter, you will be ready to tackle the first two parts of this book—the forwarding plane and the control plane. The third part, an overview of network design, builds on the first two parts. The final part of this book looks at a few specific technologies and movements likely to shape the future—not only of network engineering, but even of the production and processing of information overall.
首先必须问的一个问题是网络工程是一门艺术还是真正的工程。许多工程领域一开始更像是一门艺术。例如,在 20 世纪 70 年代初期,电子管、“线圈”和变压器等电子产品及其周围的工作在很大程度上被认为是一门艺术。到 20 世纪 80 年代中期,电子产品已变得无处不在,这开始了比任何标准化都更加严格的商品化过程。当时,电子产品被认为更多的是工程而不是艺术。到了 2010 年代,电子产品变成了“电脑的组成部分”。电子产品的设计和故障排除仍然存在一些艺术,但总的来说,他们的创作变得更加关注工程原理。问题已经从“你如何做到这一点”转变为“最便宜的方法是什么”或“最小的方法是什么” ”或者其他一些在早期被认为是次要问题。也许表达电子器件运动的一种方式是用比率。也许(这些都是非常粗略的估计),电子学一开始大约是 80% 的艺术和 20% 的工程,现在已经转向 80% 的工程和 20% 的艺术。
One question that must be asked, up front, is whether network engineering is an art, or truly engineering. Many engineering fields begin as more of an art. For instance, in the early 1970s, working on and around electronics—tubes, “coils,” and transformers—was largely considered an art. By the mid-1980s, electronics had become ubiquitous, and this began a commoditization process harsher than any standardization. Electronics then was considered more engineering than art. By the 2010s, electronics became “just the stuff that makes up computers.” There is still some art in the designing and troubleshooting of electronics, but, by and large, their creation became more focused on engineering principles. The problems have moved from “how do you do that,” to “what is the cheapest way to do that,” or “what is the smallest way to do that,” or some other problem that would have been considered second order in the earlier days. Perhaps one way to phrase the movement in electronics is in ratios. Perhaps (and these are very rough estimates), electronics started at around 80% art and 20% engineering, and has now moved to 80% engineering and 20% art.
网络工程呢?它会经历相同的阶段,最终进入 80% 工程、20% 艺术的范围吗?由于几个原因,这似乎值得怀疑。网络工程在很大程度上是在虚拟空间中工作的;尽管有电线和设备,但协议、数据和功能都位于物理基础设施之上,而不是物理基础设施本身。与电子产品不同,你可以指着一个物理对象说“这就是产品”,网络不是一个物理事物。换句话说,网络是一个概念性的“事物”,使用通过以下方式连接在一起的各种单独组件构建而成:协议和数据模型。这意味着设计选择几乎是无限变化和可塑的。每个问题都可以比电子领域更具体地处理、评估和设计。只要有新的问题需要解决,就会有新的解决方案被开发、部署,并最终(最终)从网络中删除。也许有用的比较是应用程序和各种计算设备之间的比较;无论计算设备变得多么标准化,仍然有几乎无限的软件应用程序可以在其上运行。
What about network engineering? Will it pass through the same phases, eventually moving into the 80% engineering, 20% art range? This seems doubtful for several reasons. Network engineering works in a largely virtual space; although there are wires and devices, the protocols, data, and functionality are all laid on top of the physical infrastructure, rather than being the physical infrastructure. Unlike electronics, where you can point to a physical object and say, “this is the product,” a network is not a physical thing. To put this another way, the network is a conceptual “thing” built using a wide array of individual components connected together through protocols and data models. This means design choices are almost infinitely variable and malleable. Each problem can be approached, assessed, and designed much more specifically than in electronics. So long as there are new problems to solve, there will be new solutions developed, deployed, and—eventually (finally) removed from networks. Perhaps a useful comparison is between applications and the various kinds of computing devices; no matter how standardized computing devices become, there is still an almost infinite selection of software applications to run on top.
图 1-1将有助于从一个角度说明网络和业务之间的这种“契合”。
Figure 1-1 will be useful in illustrating this “fit” between the network and the business from one perspective.
在图1-1中,灰色实线代表业务增长。垂直和水平的黑色虚线是网络容量。很多时候,网络会出现过度配置的情况,导致企业需要花费大量资金来维持未使用的容量;这些显示在灰色线阴影区域中。有时网络也会出现压力。在这些深灰色实心阴影区域,业务可以增长得更快,但网络阻碍了它。网络架构和设计的众多目标之一(这更多的是架构问题,而不是严格的设计问题;请参阅网络架构的艺术)就是让这些线条更加紧密地结合在一起。完成这部分工作需要创造力和未来思维解决问题的能力。工程师必须提出诸如“如何才能网络的构建是否可以扩大或缩小以适应业务需求?” 这不仅仅是规模和大小;然而,业务的性质甚至可能随着时间的推移而发生变化,从而推动应用程序、操作程序和操作节奏的变化。网络必须具有能够根据需要进行更改的架构,而不会引入僵化或系统和流程的硬化,最终导致网络发生灾难性故障。网络工作的这一部分通常被认为更多的是艺术而不是工程,并且在整个商业世界以某种方式发生变化之前它不太可能改变。
In Figure 1-1, the solid gray curved line is business growth. The dashed black line running vertical and horizontal is network capacity. There are many times when the network is overprovisioned, costing the business money to maintain unused capacity; these are shown in the gray line-shaded regions. There are other times when the network is under strain. In these darker gray solid-shaded regions, the business could grow faster, but the network is holding it back. One of the many objectives of network architecture and design (this is more of an architecture issue than strictly a design issue; see The Art of Network Architecture) is to bring these lines closer together. Accomplishing this part of the work requires creativity and future thinking problem-solving skills. The engineer must ask questions like “How can the network be built so it scales larger and smaller to move with the business’s requirements?” This is more than just scale and size; however, it is possible the nature of the business may even change over time, driving changes in applications, operational procedures, and operational pace. The network must have an architecture capable of changing as needed without introducing ossification, or the hardening of systems and processes that will eventually cause the network to fail in a catastrophic way. This part of the working on networks is often considered more art than engineering, and it is not likely to change until the entire business world changes in some way.
图 1-2说明了企业推动网络工程成为一门艺术的另一种方式。
Figure 1-2 illustrates another way in which businesses drive network engineering as an art.
在图1-2中,时间从左到右,特征计数从下到上。该图表表达的是随着时间的推移添加到产品中的附加功能。网络运营商 A 开始时需要一个较小的功能集,但所需的功能集将随着时间的推移而增加;其他三个网络也是如此。运行这些网络中的任何一个所需的功能集总是会在某种程度上重叠,并且它们也总是在某种程度上有所不同。如果供应商希望能够销售单一产品(或产品线)并满足所有四个网络的需求,则需要实现每个网络所需的每个独特功能。右侧图表的峰值描绘了整套功能。对于每个网络,任何产品中可用的某些功能都是不必要的——也称为代码膨胀。
In Figure 1-2, time runs from the left to the right, and feature count from the bottom to the top. What the chart expresses is the additional features added to a product over time. Network operator A will start out needing a somewhat small feature set, but the feature set required will increase over time; the same will hold true of the other three networks. The feature sets required to run any of these networks will always overlap to some degree, and they will also always be different to some degree. If a vendor wants to be able to sell a single product (or product line) and cater to all four networks, it will need to implement every unique feature required by each network. The entire set of features is depicted by the peak of the chart on the right side. For each of the networks, some percentage of the features available in any product will be unnecessary—also known as code bloat.
即使这些功能没有被使用,每一项仍然代表安全漏洞、必须测试的代码、与正在使用的功能交互的代码等。换句话说,这些未使用的功能中的每一项实际上都是一种责任网络运营商。理想的解决方案可能是为每个网络定制设备,仅包含所需的功能,但这通常不是供应商或网络运营商可以选择的选择。相反,网络工程师必须以某种方式在所需功能和可用功能之间取得平衡——这个过程绝对更像是一种艺术形式,而不是工程形式。
Even though these features are not being used, each one will still represent security vulnerabilities, code that must be tested, code that interacts with features that are being used, etc. In other words, each one of these unused features is actually a liability for the network operator. The ideal solution might be to custom build equipment for each network, containing just the features required—but this is often not a choice available to either the vendor or the network operator. Instead, network engineers must somehow balance between required features and available features—and this process is definitely more a form of art than engineering.
只要网络的构建方式与企业使用网络的方式之间存在不匹配,网络艺术与工程之间就总会存在一些相互作用。当然,每一项的百分比都会根据网络、工具和网络工程领域内的时间而有所不同,但艺术成分在网络领域可能总是比在电子工程等领域表现得更强烈。
So long as there are mismatches between the way networks can be built and the way businesses use networks, there will always be some interplay between art and engineering in networking. The percentage of each one will vary based on the network, tools, and the time within the network engineering field, of course, but the art component will probably always be more strongly represented in the networking field than it is in fields like electronics engineering.
笔记
Note
有些人可能反对在本节中使用“艺术”一词。然而,用工艺代替艺术是很容易的,如果这能让本节中的概念更容易理解的话。
Some people might object to the use of the word art in this section. It is easy enough to replace art with craft, however, if this makes the concepts in this section easier to understand.
计算机网络世界中的第一个大型讨论是网络应该采用电路交换还是分组交换。两者之间的基本区别在于电路的概念——发送器和接收器是否将网络“视为”单线或连接,在开始通信之前预先配置(或设置)了一组特定的属性?或者他们是否将网络“视为”共享资源,信息可以“随意”生成和传输?前者被认为是电路交换,而后者被认为是分组交换。电路交换倾向于提供更多的流量和传输保证,而分组交换倾向于以低得多的成本传输数据——这是您在网络工程中将遇到的众多权衡中的第一个。图1-3将使用时分复用(TDM)作为示例来说明电路交换。
The first large discussion in the computer networking world was whether networks should be circuit switched or packet switched. The basic difference between these two is the concept of a circuit—do the transmitter and receiver “see” the network as a single wire, or connection, preconfigured (or set up) with a specific set of proper-ties before they begin communicating? Or do they “see” the network as a shared resource, where information is simply generated and transmitted “at will”? The former is considered circuit switched, while the latter is considered packet switched. Circuit switching tends to provide more traffic flow and delivery guarantees, while packet switching tends to deliver data at a much lower cost—the first of many tradeoffs you will encounter in network engineering. Figure 1-3 will be used to illustrate circuit switching, using Time Division Multiplexing (TDM) as an example.
在图1-3中,任意两个设备之间的链路总带宽被分为八等份;A使用时隙A1向E发送数据,F使用时隙A2向E发送数据;B 使用时隙 B1 向 E 发送数据,F 使用时隙 B2 向 E 发送数据。每条信息的长度都是固定的,因此每条信息都可以放入系统中的单个时隙中。正在进行的数据流(因此,每个数据块代表线路上的固定时间量或时隙)。假设某个地方有一个控制器在流量将遍历的每个段上分配一个插槽:
In Figure 1-3, the total bandwidth of the links between any two devices is split up into eight equal parts; A is sending data to E using time slot A1 and F using time slot A2; B is sending data to E using time slot B1 and F using time slot B2. Each piece of information is a fixed length, so each one can be put into a single time slot in the ongoing data stream (hence, each block of data represents a fixed amount of time, or slot, on the wire). Assume there is a controller someplace assigning a slot on each of the segments the traffic will traverse:
•对于[A,E] 流量:
• For [A,E] traffic:
• 在C 处:从A 的插槽1 切换到朝向D 的插槽1
• At C: slot 1 from A is switched to slot 1 toward D
• 在D:从C 的插槽1 切换到朝向E 的插槽1
• At D: slot 1 from C is switched to slot 1 toward E
•对于[A,F] 流量:
• For [A,F] traffic:
• 在C 处:从A 开始的插槽4 切换到朝向D 的插槽4
• At C: slot 4 from A is switched to slot 4 toward D
• 在D:从C 的插槽4 切换到朝向F 的插槽3
• At D: slot 4 from C is switched to slot 3 toward F
•对于[B,E] 流量:
• For [B,E] traffic:
• 在C 处: B 方向的插槽4 切换到D 方向的插槽7
• At C: slot 4 from B is switched to slot 7 toward D
• 在D:从C 的插槽7 切换到朝向E 的插槽4
• At D: slot 7 from C is switched to slot 4 toward E
•对于[B,F] 流量:
• For [B,F] traffic:
• 在C 处: B 的插槽2 切换到D 的插槽2
• At C: slot 2 from B is switched to slot 2 toward D
• 在D:从C 的插槽2 切换到朝向F 的插槽1
• At D: slot 2 from C is switched to slot 1 toward F
网络中的数据包处理设备不需要知道哪一位数据要去哪里;只要 C 在每个时间帧中获取 A 数据流中时隙 1 中的内容,并将其复制到其向 D 发出的数据流中的时隙 1,并且 D 将其从从 C 入站的时隙 1 复制到出站到 E 的时隙 1,则传输的流量由A将在 E 处传送。关于这种流量处理,有一个有趣的地方需要注意:为了转发流量,网络中的任何设备实际上都不需要知道源或目的地是什么。通过网络传输的数据块不需要包含源地址或目标地址;他们要去哪里,从哪里来,决策都是基于控制器对每个链路中开放插槽的了解。分配给任何特定设备到设备通信的一组时隙称为电路,因为它是分配给一对设备之间的通信的带宽和网络资源。
None of the packet processing devices in the network need to know which bit of data is going where; so long as C takes whatever is in slot 1 in A’s data stream in each time frame and copies it to slot 1 in its outgoing stream toward D, and D copies it from slot 1 inbound from C to slot 1 outbound to E, traffic transmitted by A will be delivered at E. There is an interesting point to note about this kind of traffic processing—to forward the traffic, none of the devices in the network actually need to know what the source or destination is. The blocks of data being transmitted through the network do not need to contain source or destination addresses; where they are headed, and where they are coming from, decisions are all based on the controllers’ knowledge of open slots in each link. The set of slots assigned to any particular device-to-device communications is called a circuit, because it is bandwidth and network resources committed to the communications between the one pair of devices.
电路交换网络的主要优点包括:
The primary advantages of circuit switched networks include:
• 设备不需要读取标头或执行任何复杂的处理来交换数据包。这在网络的早期非常重要,当时硬件的晶体管和门数量要少得多,线路速度也较低,并且在设备中处理数据包的时间占整个网络数据包延迟的很大一部分。
• The devices do not need to read a header, or do any complex processing, to switch packets. This was extremely important in the early days of networking, when hardware had much lower transistor and gate counts, line speeds were lower, and the time to process a packet in the device was a large part of the overall packet delay through the network.
• 控制器知道可用带宽和被推向网络中各处边缘设备的流量。这使得事情变得有点简单,因为实际上有足够的可用带宽,可以设计流量以创建通过网络的最佳路径。
• The controller knows the available bandwidth and traffic being pushed toward the edge devices everywhere in the network. This makes it somewhat simple, given there is actually enough bandwidth available, to engineer traffic to create the most optimal paths through the network possible.
也有缺点,包括:
There are also disadvantages, including:
• 随着网络及其提供的服务规模的扩大,控制器的复杂性显着增加。事实上,控制器上的负载可能会变得不堪重负,从而导致网络中断。
• The complexity of the controller ramps up significantly as the network and services it offers grow in scale. The load on the controller can become overwhelming, in fact, causing network outages.
• 每个链路上的带宽没有得到最佳利用。在图 1-3中,包含*的时间块(或单元)本质上是浪费带宽。这些时隙提前分配给特定电路:即使 A 没有任何内容可向 E 传输,用于 [A,E] 流量的时隙也不能“借用”给 [A,F] 流量。
• The bandwidth on each link is not used optimally. In Figure 1-3, the blocks of time (or cells) containing an * are essentially wasted bandwidth. The slots are assigned to a particular circuit ahead of time: slots used for the [A,E] traffic cannot be “borrowed” for the [A,F] traffic even when A has nothing to transmit toward E.
• 从网络角度来看,对拓扑变化做出反应所需的时间可能相当长;本地设备必须发现变化,将其报告给控制器,并且控制器必须重新配置每个受影响的流量路径上的每个网络设备。
• The time required to react to changes in topology can be quite long in network terms; the local device must discover the change, report it to the controller, and the controller must reconfigure every network device along the path of each affected traffic flow.
TDM 系统为当今使用的网络的发展贡献了许多想法。特别是,TDM 系统塑造了早期讨论的大部分内容将数据分解为数据包以便通过网络传输,并为后来的 QoS 和流量控制方面的工作奠定了基础。这些早期 TDM 系统留给更大的网络世界的一个相当重要的想法是网络平面。
TDM systems contributed a number of ideas to the development of the networks used today. In particular, TDM systems molded much of the early discussion on breaking data into packets for transmission through the network, and laid the groundwork for much later work in QoS and flow control. One rather significant idea these early TDM systems bequeathed to the larger networking world is network planes.
笔记
Note
本章后面的部分将简要讨论服务质量,然后在本书后面的第 8 章“服务质量”中进行更深入的讨论。
Quality of Service is briefly considered in a later section in this chapter, and then in more depth in Chapter 8, “Quality of Service,” later in this book.
具体来说,TDM系统分为三个平面:
Specifically, TDM systems are divided into three planes:
•控制平面是一组协议和进程,用于构建网络设备通过网络转发流量所需的信息。在电路交换网络中,控制平面是完全独立的平面;控制器和各个设备之间通常有一个单独的网络(尽管并非总是如此,特别是在较新的电路交换系统中)。
• The control plane is the set of protocols and processes that build the information necessary for the network devices to forward traffic through the network. In circuit switched networks, the control plane is completely a separate plane; there is normally a separate network between the controller and the individual devices (though not always, particularly in newer circuit switched systems).
•数据平面(也称为转发平面)是信息通过网络的路径。这包括将线路中接收到的信号解码为帧,对其进行处理,然后将它们推回到线路上,并根据物理传输系统进行编码。
• The data plane (also known as the forwarding plane) is the path of information through the network. This includes decoding the signal received in a wire into frames, processing them, and pushing them back onto the wire, encoded according to the physical transport system.
•管理平面专注于管理网络设备,包括监控可用内存、监控队列深度、监控设备何时丢弃通过网络传输的信息等。通常很难区分管理和控制网络中的飞机。例如,如果设备被手动配置为以特定方式转发流量,那么这是管理平面功能(因为正在配置设备)还是控制平面功能(因为这是有关如何转发信息的信息)?
• The management plane is focused on managing the network devices, including monitoring the available memory, monitoring queue depth, and monitoring when the device drops the information being transmitted through the network, etc. It is often difficult to distinguish between the management and the control planes in a network. For instance, if the device is manually configured to forward traffic in a particular way, is this a management plane function (because the device is being configured) or a control plane function (because this is information about how to forward information)?
笔记
Note
这个问题没有明确的答案。然而,在本书中,任何影响通过网络转发流量的方式的内容都被视为控制平面的一部分,而影响设备物理或逻辑状态(例如接口状态)的任何内容都被视为管理平面的一部分。不要指望这些定义在现实世界中成立。
This question does not have a definitive answer. Throughout this book, however, anything that impacts the way traffic is forwarded through the network is considered part of the control plane, while anything that impacts the physical or logical state of the device, such as interface state, is considered part of the management plane. Do not expect these definitions to hold true in the real world.
帧中继、SONET、ISDN 和 X.25 是电路交换技术的示例,其中一些技术在撰写本文时仍在部署。请参阅“进一步阅读”部分,了解了解这些技术的建议资源。
Frame Relay, SONET, ISDN, and X.25 are examples of circuit switched technology, some of which are still deployed at the time of writing. See the “Further Reading” section for suggested sources for learning about these technologies.
在 20 世纪 60 年代初期到中期,数据包交换“风靡一时”。许多人正在重新思考迄今为止的网络构建方式,并考虑电路交换范式的替代方案。兰德公司 (RAND Corporation) 的保罗·巴兰 (Paul Baran) 提出了一种分组交换网络作为生存性解决方案;大约在同一时间,英国的唐纳德戴维斯提出了相同类型的系统。这些想法传到了劳伦斯·利弗莫尔实验室,导致第一个分组交换网络(称为Octopus)于 1968 年投入运行。阿帕网(ARPANET)是一个实验性分组交换网络,不久后即于 1970 年开始运行。
In the early- to mid-1960s, packet switching was “in the air.” A lot of people were rethinking the way networks had been built until then, and were considering alternatives to the circuit switched paradigm. Paul Baran, working for the RAND Corporation, proposed a packet switching network as a solution for survivability; around the same time, Donald Davies, in the UK, proposed the same type of system. These ideas made their way to the Lawrence Livermore Laboratory, leading to the first packet switched network (called Octopus) being put into operation in 1968. The ARPANET, an experimental packet switched network, began operation not long after, in 1970.
笔记
Note
交换数据包的实际过程将在第 7 章“数据包交换”中详细讨论。
The actual process of switching a packet is discussed in greater detail in Chapter 7, “Packet Switching.”
电路交换和分组交换的本质区别在于各个网络设备在流量转发中所扮演的角色,如图1-4所示。
The essential difference between circuit switching and packet switching is the role individual network devices play in the forwarding of traffic, as Figure 1-4 illustrates.
在图1-4中,A产生两个数据块。其中每一个都包含一个至少描述目的地的标头(由每个数据块中的 H 表示)。这个完整的信息包(原始数据块和标头)称为数据包。标头还可以描述数据包内部的内容,并包括转发设备在处理数据包时应采取的任何特殊处理指令 - 这些有时称为元数据或“有关数据包中数据的数据”。
In Figure 1-4, A produces two blocks of data. Each of these includes a header describing, at a minimum, the destination (represented by the H in each block of data). This complete bundle of information—the original block of data and the header—is called a packet. The header can also describe what is inside the packet, and include any special handling instructions forwarding devices should take when processing the packet—these are sometimes called metadata, or “data about the data in the packet.”
A 产生了两个数据包:A1,发往 E;B 也发送两个数据包:B1(发往 F)和 B2(发往 E)。当 C 收到时对于这些数据包,它读取数据包标头的一小部分(通常称为字段)来确定目的地。然后,C 查阅本地表来确定数据包应在哪个出接口上传输。D 也执行同样的操作,将数据包从正确的接口转发到目的地。
There are two packets produced by A: A1, destined to E; and A2, destined to F. B sends two packets as well: B1, destined to F, and B2, destined to E. When C receives these packets, it reads a small part of the packet header, often called a field, to determine the destination. C then consults a local table to determine which outbound interface the packet should be transmitted on. D does likewise, forwarding the packet out the correct interface toward the destination.
这种转发流量的方式称为逐跳转发,因为网络中的每个设备都完全独立地决定将每个单独的数据包转发到何处。每个设备查阅的本地表称为转发表;这通常不是一张表,而是许多张表,可能包括路由信息库(RIB)和转发信息库(FIB)。
This way of forwarding traffic is called hop-by-hop forwarding, because each device in the network makes a completely independent decision about where to forward each individual packet. The local table each device consults is called a forwarding table; this normally is not one table, but many tables, potentially including a Routing Information Base (RIB) and a Forwarding Information Base (FIB).
笔记
Note
这些表、它们的构建方式以及使用方式,将在第 7 章“数据包交换”中进行更全面的解释。
These tables, how they are built, and how they are used, are explained more fully in Chapter 7, “Packet Switching.”
在最初的电路交换系统中,控制平面与通过网络的数据包转发完全分开。随着从电路交换到分组交换的转变,相应地从集中式控制器决策转变为在网络本身上运行的分布式协议。对于后者,每个节点都能够在本地做出自己的转发决策。网络中的每个设备都运行分布式协议来获取构建这些本地表所需的信息。这种模型称为分布式控制平面;因此,控制平面的概念只是从一种模型转移到另一种模型,尽管它们实际上并不意味着同一件事。
In the original circuit switched systems, the control plane is completely separate from packet forwarding through the network. With the move from circuit to packet switched, there was a corresponding move from centralized controller decisions to a distributed protocol running over the network itself. For the latter, each node is capable of making its own forwarding decisions locally. Each device in the network runs the distributed protocol to gain the information needed to build these local tables. This model is called a distributed control plane; thus the idea of a control plane was simply transferred from one model to the other, although they do not actually mean the same thing.
分组交换网络可以使用集中式控制平面,电路交换网络可以使用分布式控制平面。然而,在最初设计和部署分组交换网络时,它们通常使用分布式控制平面。软件定义网络 (SDN) 将集中控制平面的概念带回分组交换网络世界。
Packet switching networks can use a centralized control plane, and circuit switching networks can use distributed control planes. At the time packet switched networks were first designed and deployed, however, they typically used distributed control planes. Software-Defined Networks (SDNs) brought the concept of centralized control planes back into the world of packet switched networks.
分组交换网络相对于电路交换网络的第一个优点是逐跳转发范例。由于每个设备都可以做出完全独立的转发决策,因此数据包可以根据网络拓扑的变化动态转发,从而无需与控制器通信并等待决策。只要源和目的地之间至少有两条路径(网络是两个相连的),由源交给网络的数据包最终将由网络交给目的地。
The first advantage the packet switched network has over a circuit switched network is the hop-by-hop forwarding paradigm. As each device can make a completely independent forwarding decision, packets can be dynamically forwarded around changes in the network topology, eliminating the need to communicate to the controller and await a decision. So long as there are at least two paths between the source and the destination (the network is two connected), packets handed to the network by the source will eventually be handed to the destination by the network.
分组交换网络相对于电路交换网络的第二个优点是分组交换网络使用带宽的方式。在电路交换网络中,如果不使用特定电路(实际上是给出的 TDM 示例中的时隙),则该时隙就被浪费了。在逐跳转发中,每个设备可以最好地利用每个出站链路上的可用带宽来承载必要的流量负载。虽然这在本地更复杂,但在全局上更简单,并且可以更好地利用网络资源。
The second advantage the packet switched network has over a circuit switched network is the way the packet switched network uses bandwidth. In the circuit switched network, if a particular circuit (really a time slot in the TDM example given) is not used, then the slot is simply wasted. In hop-by-hop forwarding, each device can best use the bandwidth available on each outbound link to carry the necessary traffic load. While this is locally more complex, it is globally simpler, and it makes better use of network resources.
分组交换网络的主要缺点是需要额外的复杂性,特别是在转发过程中。每个设备必须能够读取数据包标头,在表中查找目的地,然后根据表查找结果转发信息。在早期的硬件中,这些都是困难且耗时的任务。电路交换通常比分组交换更快。随着硬件随着时间的推移而改进,交换可变长度分组的速度通常足够接近交换固定长度分组的速度,使得分组交换和电路交换之间几乎没有区别。
The primary disadvantage of packet switched networks is the additional complexity required, particularly in the forwarding process. Each device must be able to read the packet header, look up the destination in a table, and then forward the information based on the table lookup results. In early hardware, these were difficult, time-consuming tasks; circuit switching was generally faster than packet switching. As hardware has improved over time, the speed of switching a variable length packet is generally close enough to the speed of switching a fixed length packet that there is little difference between packet and circuit switching.
在电路交换网络中,控制器通过分配从源到目的地的时隙来为每个电路分配特定量的带宽。如果发射机想要发送的流量超过分配的时隙所能支持的流量,会发生什么情况?答案很简单——不能。从某种意义上说,电路交换网络内置了控制网络数据包流的能力。没有办法发送比网络可以转发的流量更多的流量,因为发射器可用于信息发送的“空间”是预先分配的。
In a circuit switched network, the controller allocates a specific amount of band-width to each circuit by assigning time slots from the source to the destination. What happens if the transmitter wants to send more traffic than the allocated time slots will support? The answer is simple—it cannot. In a sense, then, the ability to control the flow of packets through the network is built in to a circuit switched network; there is no way to send more traffic than the network can forward, because “space” the transmitter has at its disposal for information sending is pre-allocated.
分组交换网络怎么样?如果图1-4所示网络中的所有链路都具有相同的链路速度,那么如果A和B都想使用通往C的整个链路容量,会发生什么情况?C 将如何决定如何在其需要处理的流量一半的链路上将其全部发送到 D?这是可以使用交通流控制技术的地方。通常,它们被实现为“骑在”底层网络之上的单独协议/规则集,通过在两个通信设备之间构建虚拟电路来帮助“组织”数据包传输。
What about packet switched networks? If all the links in the network shown in Figure 1-4 have the same link speed, what happens if both A and B want to use the entire link capacity toward C? How will C decide how to send it all to D on a link that is half as large as the traffic it needs to handle? Here is where traffic flow control techniques can be used. Typically, they are implemented as a separate protocol/rule set “riding on top of” the underlying network, helping to “organize” packet transmission by building a virtual circuit between the two communicating devices.
笔记
Note
流量和错误控制将在第 2 章“数据传输问题和解决方案”中详细讨论。
Flow and error control are discussed in detail in Chapter 2, “Data Transport Problems and Solutions.”
传输控制协议 (TCP) 为基于互联网协议 (IP) 的分组交换网络提供流量控制。该协议由 Vint Cerf 和 Bob Kahn 于 1973 年首次指定。
The Transmission Control Protocol (TCP) provides flow control for Internet Protocol (IP) based packet switched networks. This protocol was first specified in 1973 by Vint Cerf and Bob Kahn.
20 世纪 80 年代末,一个新的话题席卷了网络工程界——异步传输模式 (ATM)。对更高速度电路的需求,加上根据目标地址单独交换数据包的进展缓慢,导致了对新传输形式的推动,这种形式最终将重新配置整个集合(或堆栈,因为每个协议都形成一个层在下面的协议之上,就像现代网络中使用的协议的“堆叠蛋糕”一样。ATM 将电路交换的固定长度单元(或数据包)大小与数据包交换的报头(尽管大大简化)相结合,产生了“中间”技术解决方案。ATM 有两个关键点:标签交换和固定呼叫大小;图 1-5说明了第一个。
In the late 1980s, a new topic of discussion washed over the network engineering world—Asynchronous Transfer Mode (ATM). The need for ever higher speed circuits, combined with slow progress in switching packets individually based on their destination addresses, led to a push for a new form of transport that would, eventually, reconfigure the entire set (or stack, because each protocol forms a layer on top of the protocol below, like a “stacked cake”) of protocols used in modern networks. ATM combined the fixed length cell (or packet) size of circuit switching with a header from packet switching (although greatly simplified) to produce an “in between” technology solution. There were two key points to ATM: label switching and fixed call sizes; Figure 1-5 illustrates the first.
如图1-5所示,G发送了一个目的地址为H的报文。A收到该报文后,检查本地表,发现到H的下一跳是C。A的本地表还指定了标签,显示为 L,而不仅仅是有关将数据包转发到何处的信息。A将此标签插入数据包头部的专用字段中,并将其转发给C。C收到数据包时,不需要读取头部中的目的地址;相反,它只是读取标签,这是一个短的、固定长度的字段。在本地表中查找标签,该表告诉 C 将流量转发到目的地 H 的 D。标签非常小,因此很容易为转发设备处理,从而使切换速度更快。
In Figure 1-5, G sends a packet destined to H. On receiving this packet, A examines a local table, and finds the next hop toward H is C. A’s local table also specifies a label, shown as L, rather than “just” information about where to forward the packet. A inserts this label into a dedicated field at the head of the packet and forwards it to C. When C receives the packet, it does not need to read the destination address in the header; rather, it just reads the label, which is a short, fixed length field. The label is looked up in a local table, which tells C to forward traffic to D for destination H. The label is very small, and is therefore easy to process for the forwarding devices, making switching much faster.
从某种意义上说,标签还可以“包含”数据包的处理信息。例如,如果 G 和 H 之间实际上有两个流量流,则可以通过网络为每个流量分配不同的标签(或一组标签)。携带一个标签的数据包可以比携带另一标签的数据包优先,因此网络设备不需要查看标头中的任何字段来确定如何处理特定数据包。
The label can also “contain” handling information for the packet, in a sense. For instance, if there are actually two streams of traffic between G and H, each one can be assigned a different label (or set of labels) through the network. Packets carrying one label can be given priority over packets carrying another label, so the network devices do not need to look at any fields in the header to determine how to process a particular packet.
这可以看作是分组交换和电路交换之间的折衷。虽然每个数据包仍然逐跳转发,但也可以通过网络的标签路径来定义虚拟电路。第二点是 ATM 也基于固定大小的信元:每个数据包仅限 53 个八位位组的信息。固定大小的单元似乎是一个小问题,但固定大小的数据包可以带来巨大的性能差异。图 1-6说明了与固定单元尺寸相关的一些因素。
This can be seen as a compromise between packet and circuit switching. While each packet is still forwarded hop by hop, a virtual circuit can also be defined by the label path through the network. The second point was that ATM was also based on a fixed sized cell: each packet was limited to 53 octets of information. Fixed size cells may seem to be a minor issue, but fixed size packets can make a huge performance difference. Figure 1-6 illustrates some factors involved in fixed cell sizes.
在图 1-6中,数据包 1 (A1) 从网络复制到线卡或接口LC1 上的内存中;然后它穿过 B 内部的内部结构(在内存位置之间)到达LC2,最后在 B 的出站接口处放回到网络上。从这样的图表来看,这似乎微不足道,但也许是最重要的设备交换/处理数据包的速度的一个重要因素是在内存位置之间的任何内部路径上复制数据包所需的时间。将信息从内存中的一个位置复制到另一个位置的过程是设备可以执行的最慢的操作之一,尤其是在较旧的处理器上。使每个数据包相同(固定单元大小)可以优化复制过程的代码,从而显着提高交换速度。
In Figure 1-6, packet 1 (A1) is copied from the network into memory on a line card or interface, LC1; then it travels across the internal fabric inside B (between memory locations) to LC2, being finally placed back onto the network at B’s outbound interface. It might seem trivial from such a diagram, but perhaps the most important factor in the speed at which a device can switch/process packets is the time it takes to copy the packet across any internal paths between memory locations. The process of copying information from one place in memory to another is one of the slowest operations a device can undertake, particularly on older processors. Making every packet the same (a fixed cell size) allowed code optimizations around the copy process, dramatically increasing switching speed.
笔记
Note
第 7 章“数据包交换”中介绍了跨内部结构交换数据包的过程。
The process of switching a packet across an internal fabric is considered in Chapter 7, “Packet Switching.”
从性能角度来看,数据包 2 通过 B 的路径更糟糕;它首先从网络复制到本地内存中。当通过查找本地转发表确定目标端口时,处理数据包的代码会意识到必须对数据包进行分段,以适应出站 [B,C] 链路上允许的最大数据包大小。入站线卡LC1将数据包分段为A1和A2,创建第二个标头并根据需要调整标头中的任何值。该数据包被分为两个数据包,A1和A2。这两个数据包在两次操作中通过结构复制到出站线卡LC2。通过使用固定大小的单元,ATM 避免了几乎所有其他分组交换系统所引起的分组分段的性能成本(在 ATM 被提出时)。
Packet 2’s path through B is even worse from a performance perspective; it is copied off the network into local memory first. When the destination port is determined by looking in the local forwarding table, the code processing the packet realizes the packet must be fragmented to fit into the largest packet size allowed on the outbound [B,C] link. The inbound line card, LC1, fragments the packet into A1 and A2, creating a second header and adjusting any values in the header as needed. The packet is divided into two packets, A1 and A2. These two packets are copied in two operations across the fabric to the outbound line card, LC2. By using fixed size cells, ATM avoids the performance cost of fragmenting packets (at the time ATM was being proposed) incurred by almost every other packet switching system.
事实上,ATM 并不是从网络核心开始一直延伸到网络边缘。为什么不?第一个答案在于单元尺寸的相当奇怪的选择。为什么是 53 个八位字节?答案很简单——也许有点令人惊讶。ATM 不仅要取代分组交换网络,还要取代当时基于电路交换技术的语音网络。通过统一这两种技术,提供商可以在一组电路和设备上提供两种服务。
ATM did not, in fact, start at the network core and work its way to the network edge. Why not? The first answer lies in the rather strange choice of cell size. Why 53 octets? The answer is simple—and perhaps a little astounding. ATM was supposed to replace not only packet switched networks, but also the then-current generation of voice networks based on circuit switched technologies. In unifying these two technologies, providers could offer both sorts of services on a single set of circuits and devices.
多少信息量或数据包大小最适合承载语音流量?大约 48 个八位字节。对数据传输有意义的最小信息量或数据包大小是多少?大约 64 个八位位组。选择五十三个八位位组作为这两种大小之间的折衷方案;它对于语音传输来说并不完美,因为每个承载语音的单元的 5 个八位字节将被浪费。它对于数据流量来说并不完美,因为最常见的数据包大小(64 个八位字节)需要分成两个单元才能在 ATM 网络上传输。在进行这些审议时,一个共同的想法是数据传输协议将能够适应稍小的单元大小,从而使 53 个八位位组成为支持各种流量的最佳大小。然而,数据传输协议 没有调整。传输 64 个八位字节块数据,一个单元格将包含 53 个八位位组,第二个单元格将包含 9 个八位位组,其中有 42 个八位位组的空白空间。提供商发现 ATM 链路上 50% 或更多的可用带宽被空单元消耗——实际上浪费了带宽。因此,数据提供商停止部署 ATM,语音提供商从未真正开始部署它,ATM 就消亡了。
What amount of information, or packet size, is ideal for carrying voice traffic? Around 48 octets. What amount of information, or packet size, is the minimum that makes any sort of sense for data transmission? Around 64 octets. Fifty-three octets was chosen as a compromise between these two sizes; it would not be perfect for voice transmission, as 5 octets of every cell carrying voice would be wasted. It would not be perfect for data traffic, because the most common packet size, 64 octets, would need to be split into two cells to be carried across an ATM network. A common line of thinking, at the time these deliberations were being held, was the data transport protocols would be able to adjust to the slightly smaller cell size, hence making 53 octets an optimal size to support a wide variety of traffic. The data transport protocols, however, did not adjust. To transport a 64-octet block of data, one cell would contain 53 octets, and the second would contain 9 octets, with 42 octets of empty space. Providers discovered 50% or more of the bandwidth available on ATM links was consumed by empty cells—effectively wasted bandwidth. Hence, data providers stopped deploying ATM, voice providers never really started deploying it, and ATM died.
有趣的是 ATM 等项目的遗产如何在其他协议和想法中得以延续。标签交换的概念被 Yakov Rekhter 和其他工程师采纳,并发展成为标签交换。这保留了 ATM 在转发路径中快速查找的许多基本优点,并将有关数据包处理的元数据捆绑到标签本身中。标签交换最终成为多协议标签交换(MPLS),它不仅提供更快的查找,还提供标签堆栈和虚拟化。因此,基本思想被采纳并扩展,对现代网络协议和设计产生了重大影响。
What is interesting is how the legacy of projects like ATM live on in other protocols and ideas. The label switching concept was picked up by Yakov Rekhter and other engineers, and developed into label switching. This keeps many of the fundamental advantages of ATM’s quick lookup in the forwarding path, and bundling the metadata about packet handling into the label itself. Label switching eventually became Multiprotocol Label Switching (MPLS), which not only provides faster lookup, but also stacks of labels and virtualization. The basic idea was thus taken and expanded, impacting modern network protocols and designs in significant ways.
ATM 的第二个遗产是固定信元大小。多年来,基于 TCP 和 IP 的主流网络传输套件允许网络设备在转发数据包时对数据包进行分段。然而,这是一种众所周知的降低网络性能的方法。IP 标头中添加了“不分段”位,告诉网络设备丢弃数据包而不是对它们进行分段,并且我们付出了很大的努力来发现可以通过网络在任何设备对之间传输的最大数据包。新一代 IP(称为 IPv6)从协议规范中删除了网络设备的碎片。
The second legacy of ATM is the fixed cell size. For many years, the dominant network transport suite, based on TCP and IP, has allowed network devices to fragment packets while forwarding them. This is a well-known way to degrade the performance of a network, however. A do not fragment bit was added to the IP header, telling network devices to drop packets rather than fragmenting them, and serious efforts were put into discovering the largest packet that can be transmitted through the network between any pair of devices. A newer generation of IP, called IPv6, removed fragmentation by network devices from the protocol specification.
另一个问题与网络工程领域之前的许多讨论重叠,通常使决定分组交换还是电路交换是更好的解决方案变得更加困难。在分组交换网络中应如何计算无环路路径?
Overlapping many of these previous discussions within the network engineering world was another issue that often made it more difficult to decide whether packet or circuit switching was the better solution. How should loop-free paths be computed in a packet switched network?
由于在整个网络工程的历史中,分组交换网络一直与分布式控制平面相关联,而电路交换网络一直与集中式控制平面相关联,因此有效计算无环路路径的问题对决定是否使用分组交换网络产生了重大影响。网络是否可行。
As packet switched networks have, throughout the history of network engineering, been associated with distributed control planes, and circuit switched networks have been associated with centralized control planes, the issue of computing loop-free paths efficiently had a major impact on deciding whether packet switched networks were viable or not.
在网络工程的早期,可用的处理能力、内存和带宽常常供不应求。表 1-1提供了一些历史背景。
In the early days of network engineering, the available processing power, memory, and bandwidth were often in short supply. Table 1-1 provides a little historical context.
Table 1-1 History of Computing Power, Memory, and Bandwidth
年 Year |
米普斯 MIPS |
内存(成本/MB) Memory (Cost/MB) |
带宽(局域网) Bandwidth (LAN) |
1984年 1984 |
3.2(摩托罗拉68010) 3.2 (Motorola 68010) |
第1331章 1331 |
2Mb/秒 2Mb/s |
1987年 1987 |
6.6(摩托罗拉68020) 6.6 (Motorola 68020) |
154 154 |
10Mb/秒 10Mb/s |
1990年 1990 |
44(摩托罗拉 68040) 44 (Motorola 68040) |
98 98 |
16Mb/秒 16Mb/s |
1996年 1996 |
541(英特尔奔腾专业版) 541 (Intel Pentium Pro) |
8 8 |
100Mb/秒 100Mb/s |
1999年 1999 |
2,054(英特尔奔腾 III) 2,054 (Intel Pentium III) |
1 1 |
100Mbps 100Mbp/s |
2006年 2006 |
49,161(英特尔酷睿 2、4 核) 49,161 (Intel Core 2, 4 cores) |
0.1 0.1 |
4Gb/秒 4Gb/s |
2014年 2014 |
238,310(英特尔 i7,4 核) 238,310 (Intel i7, 4 cores) |
0.001 0.001 |
100Gb/秒 100Gb/s |
1984 年,当许多此类讨论正在进行时,两种计算网络无环路路径的方法之间的处理器和内存数量的任何差异都会对构建网络的成本产生重大影响。当带宽非常宝贵时,减少控制平面传输计算网络中一组无环路路径所需的信息所需的位数,会对网络可以处理的用户流量产生真正的影响。减少控制操作所需的位数也会使网络在较低带宽下的稳定性产生很大差异。
In 1984, when many of these discussions were occurring, any difference in the amount of processor and memory between two ways of calculating loop-free paths through a network would have a material impact on the cost of building a network. When bandwidth is at a premium, reducing the number of bits a control plane required to transfer the information required to calculate a set of loop-free paths through a network makes a real difference in the amount of user traffic the network can handle. Reducing the number of bits required for the control to operate also makes a large difference in the stability of the network at lower bandwidths.
例如,使用类型长度向量 (TLV) 格式来描述网络上承载的控制平面信息会向整个数据包长度添加几个八位字节的信息,但在 2Mbps 链路的上下文中,由于控制平面的频繁通信而加剧了这种情况,成本可能远远超过协议可扩展性的长期优势。
For instance, using a Type Length Vector (TLV) format to describe control plane information carried across the network adds a few octets of information to the overall packet length—but in the context of a 2Mbps link, aggravated by a chatty control plane, the costs could far outweigh the longer-term advantage of protocol extensibility.
笔记
Note
TLV 将在第 2 章“数据传输问题和解决方案”中讨论。
TLVs are discussed in Chapter 2, “Data Transport Problems and Solutions.”
协议之争在某些时候相当激烈。我们开展了整个研究项目,并撰写了论文,探讨一种方案为何以及如何优于另一种方案。作为这些反复产生的争论的一个例子,在互联网工程任务组 (IETF) 开发开放最短路径优先 (OSPF) 协议期间看到的一件衬衫上写着: IS-IS = 0。这里的“IS-IS”指的是Intermediate System-to-Intermediate System,最初是由国际标准化组织(ISO)制定的一种控制平面(路由协议)。
The protocol wars were rather heated at some points; entire research projects were undertaken, and papers written, about why and how one protocol was better than another. As an example of the kind of back and forth these arguments generated, a shirt seen at the Internet Engineering Task Force (IETF) during which the Open Shortest Path First (OSPF) Protocol was being developed said: IS-IS = 0. The “IS-IS” here refers to Intermediate System-to-Intermediate System, a control plane (routing protocol) originally developed by the International Organization for Standardization (ISO).
人们提出了多种机制来解决通过网络计算无环路路径的问题;最终,三类通用解决方案得到了广泛部署和使用:
There was a wide variety of mechanisms proposed to solve the problems of calculating loop-free paths through a network; ultimately three general classes of solutions have been widely deployed and used:
•距离矢量协议,根据路径成本逐跳计算无环路路径
• Distance Vector protocols, which calculate loop-free paths hop by hop based on the path cost
•链路状态协议,计算跨网络设备同步的数据库的无环路路径
• Link State protocols, which calculate loop-free paths across a database synchronized across the network devices
•路径矢量协议,根据先前跳跃的记录逐跳计算无环路路径
• Path Vector protocols, which calculate loop-free paths hop by hop based on a record of previous hops
关于哪种协议最适合每个特定网络以及出于什么特定原因的讨论仍然存在;这可能是一场永无止境的对话,因为这个问题(可能)没有最终答案。相反,就像将网络安装到企业中一样,使特定控制平面在特定网络上运行可能始终需要一定程度的艺术(或工艺)。然而,随着网络处理能力、内存和带宽速度的不断提高,这个问题变得越来越紧迫。
The discussion over which protocol is best for each specific network, and for what particular reasons, still persists; it is probably a never-ending conversation, as there is (probably) no final answer to the question. Instead, as with fitting a network to a business, there will probably always be some degree of art (or craft) involved in making a particular control plane work on a particular network. Much of the urgency in the question, however, has been drawn out by the increasing speed of networks—in processing power, memory, and bandwidth.
随着实时流量开始通过分组交换网络传输,QoS 成为一个主要问题。语音和视频都依赖于网络能够在主机之间快速传输流量(具有低延迟),并且数据包间距(抖动)有少量变化。围绕 QoS 的讨论实际上开始于分组交换网络的早期,但在考虑 ATM 前后达到了顶峰。事实上,ATM 的主要优点之一是能够严格控制数据包在数据包交换网络上传输时的处理方式。由于ATM机故障市场上,对于需要对抖动和延迟进行强有力控制的应用出现了两种不同的思路:
As real-time traffic started to be carried over packet switched networks, QoS became a major problem. Voice and video both rely on the network being able to carry traffic between hosts quickly (having low delay), and with small amounts of variability in interpacket spacing (jitter). Discussions around QoS actually began in the early days of packet switched networking, but reached a high point around the time ATM was being considered. In fact, one of the main advantages of ATM was the ability to closely control the way in which packets were handled as they were carried over a packet switched network. With the failure of ATM in the market, two distinct lines of thought emerged about applications that require strong controls on jitter and delay:
• 这些应用程序永远无法在分组交换网络上运行;此类应用程序始终需要在单独的网络上运行。
• These applications would never work on packet switched networks; these kinds of applications would always need to be run on a separate network.
• 这只是找到正确的 QoS 控制集以允许此类应用程序在分组交换网络上运行的问题。
• It is just a matter of finding the right set of QoS controls to allow such applications to run on packet switched networks.
笔记
Note
服务质量将在第 8 章“服务质量”中详细讨论。
Quality of Service is discussed in detail in Chapter 8, “Quality of Service.”
大多数提供商和工程师关心的主要应用是语音,根本问题归结为:是否有可能通过还承载大量文件传输和其他“非实时”流量的网络提供良好的语音?人们发明了复杂的方案来对数据包进行分类和标记(称为 QoS 标记),以便网络设备知道如何正确处理它们。映射系统的开发是为了将这些 QoS 标记从一种类型的网络转移到另一种类型的网络,并且投入了大量的时间和精力来研究排队机制(数据包在接口上发送的顺序)。图1-7显示了一个QoS系统的示例图表,应用程序和QoS标记之间的映射足以说明这些系统的复杂性。
The primary application most providers and engineers were concerned about was voice, and the fundamental question came down to this: is it possible to provide decent voice over a network also carrying large file transfers and other “non-real-time” traffic? Complex schemes were invented to allow packets to be classified and marked (called QoS marking) so network devices would know how to handle them properly. Mapping systems were developed to carry these QoS markings from one type of network to another, and a lot of time and effort were put into researching queueing mechanisms—the order in which packets are sent out on an interface. Figure 1-7 shows a sample chart of one QoS system and the mapping between applications and QoS markings will suffice to illustrate the complexity of these systems.
如之前的表 1-1所示,链路速度的增加对围绕 QoS 的讨论产生了两个影响:
The increasing link speeds, shown previously in Table 1-1, had two effects on the discussion around QoS:
• 更快的链接(显然)将承载更多数据。随着任何单独的语音和视频流成为整体带宽使用量不断缩小的一部分,在不同应用程序之间强烈平衡带宽使用的需求变得不再那么重要。
• Faster links will (obviously) carry more data. As any individual voice and video stream becomes a shrinking part of the overall bandwidth usage, the need to strongly balance the use of bandwidth between different applications became less important.
• 随着带宽的每次增加,将数据包从内存通过物理芯片移动到线路上所需的时间也会减少。
• The amount of time required to move a packet from memory onto the wire through a physical chip is reduced with each increase in bandwidth.
随着可用带宽的增加,对复杂的排队策略来应对抖动的需求变得不再那么重要。更新的排队系统增强了速度的提高,这些系统在管理不同类型的流量方面更加有效,减少了以细粒度方式标记和处理流量的必要性。
As available bandwidth increased, the need for complex queueing strategies to counter jitter became less important. This increase in speed has been augmented by newer queueing systems that are much more effective at managing different kinds of traffic, reducing the necessity of marking and handling traffic in a fine-grained fashion.
带宽的增加通常是通过从铜线改为玻璃纤维来实现的。光纤不仅提供更大的带宽,而且数据传输更可靠。物理链接的构建方式也不断发展,使其更能抵抗断裂和其他材料问题。提高带宽可用性的第二个因素是互联网的发展。随着网络变得更加普遍和连接更加紧密,单个链路故障对可用带宽量和网络流量的影响越来越小。
These increases in bandwidth were often enabled by changing from copper to glass fiber. Fiber not only offers larger bandwidths but also more reliable transmission of data. The way physical links are built also evolved, making them more resistant to breakage and other material problems. A second factor increasing bandwidth availability was the growth of the Internet. As networks became more common and more connected, a single link failure had a lesser impact on the amount of available bandwidth and on the traffic flows across the network.
随着处理器变得更快,开发出丢包和延迟数据包对实时流质量影响较小的系统成为可能。提高处理器速度还可以使用非常有效的压缩算法,从而减小每个流的大小。在网络方面,更快的处理器意味着控制平面可以更快地计算一组通过网络的无环路路径,从而减少链路和设备故障的直接和间接影响。
As processors became faster, it became possible to develop systems where dropped and delayed packets would have less effect on the quality of a real-time stream. Increasing processor speeds also made it possible to use very effective compression algorithms, reducing the size of each stream. On the network side, faster processors meant the control plane could compute a set of loop-free paths through the network faster, reducing both direct and indirect impacts of link and device failures.
最终,尽管 QoS 仍然很重要,但它可以大大简化。四到六个队列通常足以支持最困难的应用程序。如果需要更多,一些系统现在可以通过网络设计流量,或者主动管理队列,以平衡队列管理的复杂性和应用程序支持。
Ultimately, although QoS is still important, it can be much simplified. Four to six queues are often enough to support even the most difficult applications. If more are needed, some systems can now either engineer traffic flows through a network or actively manage queues, to balance between the complexity of queue management and application support.
在 20 世纪 90 年代,为了解决分组交换网络的许多明显问题,例如复杂的控制平面和 QoS 管理,研究人员开始研究称为“主动网络”的概念。总体想法是,数据包交换网络的控制平面可以而且应该与转发设备分开,以便允许网络与其上运行的应用程序交互。
In the 1990s, in order to resolve many of the perceived problems with packet switched networks, such as complex control planes and QoS management, researchers began working on a concept called Active Networking. The general idea was that the control plane for a packet switched network could, and should, be separated from the forwarding devices in order to allow the network to interact with the applications running on top of it.
在 IETF 组建转发和控制元素分离 (ForCES) 工作组时,再次考虑了在分组交换网络中更清晰地分离控制平面和数据平面的基本概念。该工作组主要关注创建一个应用程序可用于将转发信息安装到网络设备上的接口。该工作组最终于 2015 年被关闭,其标准从未得到广泛实施。
The basic concept of separating the control and data planes more distinctly in packet switching networks was again considered in the formation of the Forwarding and Control Element Separation (ForCES) working group in the IETF. This working group was primarily concerned with creating an interface applications can use to install forwarding information onto network devices. The working group was eventually shut down in 2015 and its standards were never widely implemented.
2006 年,研究人员开始寻找一种方法来试验分组交换网络中的控制平面,而无需对设备本身进行代码修改——这是一个特殊的问题,因为大多数这些设备都是由供应商作为不可修改的设备(或黑匣子)出售的。 。最终的结果是OpenFlow,这是一个标准接口,允许应用程序直接在转发表(而不是路由表)中安装条目;本书第一部分“数据平面”中的几个地方对此进行了更全面的解释。”)。该研究项目被多家供应商作为一项功能,供应商和开源项目已经创建了各种各样的控制器。许多工程师相信 OpenFlow 将通过集中控制平面彻底改变网络工程。
In 2006, researchers began looking for a way to experiment with control planes in packet switched networks without the need to code modifications on the devices themselves—a particular problem, as most of these devices were sold by vendors as unmodifiable appliances (or black boxes). The eventual result was OpenFlow, a standard interface that allows applications to install entries directly in the forwarding table (rather than the routing table; this is explained more fully in several places in Part I of this book, “The Data Plane”). The research project was picked up as a feature by several vendors, and a wide array of controllers have been created by vendors and open source projects. Many engineers believed OpenFlow would revolutionize network engineering by centralizing the control plane.
现实可能大不相同——可能发生的情况是数据网络世界中一直发生的情况:集中式控制平面的更好部分将被消耗到现有系统中,而完全集中式模型将落入路边,留下的路径改变了关于控制平面如何与应用程序和整个网络交互的想法。
The reality is likely to be far different—what is likely to happen is what has always happened in the world of data networking: the better parts of a centralized control plane will be consumed into existing systems, and the fully centralized model will fall to the wayside, leaving in its path changed ideas about how the control plane interacts with applications and the network at large.
到目前为止所描述的技术——电路和数据包交换、控制平面和 QoS——非常复杂。事实上,增长的趋势似乎没有止境网络的复杂性,特别是当应用程序和业务的要求变得越来越高时。本节将考虑与复杂性和网络相关的两个具体问题:
The technologies described thus far—circuit and packet switching, control planes, and QoS—are very complex. In fact, there appears to be no end to the increasing complexity in networks, particularly as applications and businesses become more demanding. This section will consider two specific questions in relation to complexity and networks:
• 什么是网络复杂性?
• What is network complexity?
• 网络复杂性能否“解决”?
• Can network complexity be “solved”?
本节的最后部分将考虑一种将复杂性视为一组权衡的方法。
The final parts of this section will consider a way of looking at complexity as a set of tradeoffs.
虽然最明显的起点可能是复杂性的定义,但实际上更有用的是考虑为什么在更一般的意义上需要复杂性。更简单地说,是否可以“解决”复杂性?为什么不设计更简单的网络和协议呢?为什么从长远来看,每一次让网络世界变得更简单的尝试最终都会使事情变得更加复杂?
While the most obvious place to begin might be with a definition of complexity, it is actually more useful to consider why complexity is required in a more general sense. To put it more succinctly, is it possible to “solve” complexity? Why not just design simpler networks and protocols? Why does every attempt to make anything simpler in the networking world end up apparently making things more complex in the long run?
例如,通过在 IP 之上(或通过 IP)建立隧道,可以降低控制平面的复杂性,并且整体上使网络变得更简单。那么为什么隧道覆盖最终会包含如此多的复杂性呢?
For instance, by tunneling on top of (or through) IP, the control plane’s complexity is reduced, and the network is made simpler overall. Why then do tunneled overlays end up containing so much complexity?
这个问题有两个答案。首先,人性就是如此,工程师总是会发明十种不同的方法来解决同一问题。在虚拟世界中尤其如此,新的解决方案(相对)容易部署,(相对)容易发现最后一组提出的解决方案的问题,并且(相对)容易移动一些位创建一种“比旧解决方案更好”的新解决方案。从供应商的角度来看尤其如此,因为构建新产品通常意味着能够销售全新的产品和技术系列——即使这些技术看起来非常像旧的技术。换句话说,虚拟空间之所以如此混乱,是因为在那里很容易构建新的东西。
There are two answers to this question. First, human nature being what it is, engineers will always invent ten different ways to solve the same problem. This is especially true in the virtual world, where new solutions are (relatively) easy to deploy, it is (relatively) easy to find a problem with the last set of proposed solutions, and it is (relatively) easy to move some bits around to create a new solution that is “better than the old one.” This is particularly true from a vendor perspective, when building something new often means being able to sell an entirely new line of products and technologies—even if those technologies look very much like the old ones. The virtual space, in other words, is partially so messy because it is so easy to build something new there.
然而,第二个答案在于一个更根本的问题:复杂性对于处理难以解决的问题所涉及的不确定性是必要的。如图 1-8所示。
The second answer, however, lies in a more fundamental problem: complexity is necessary to deal with the uncertainty involved in difficult to solve problems. Figure 1-8 illustrates.
增加复杂性似乎可以让网络更轻松地处理未来的需求和意外事件,并通过较小的基本功能集提供更多服务。如果是这种情况,为什么不简单地构建一个在单个网络上运行的单个协议,该协议能够处理可能抛出的所有需求,并且可以处理您可以想象的任何事件序列?单一网络运行单一协议肯定会减少网络工程师需要处理的移动部件的数量,使我们的生活变得更简单,对吧?事实上,有许多不同的方法来管理复杂性,例如:
Adding complexity seems to allow a network to handle future requirements and unexpected events more easily, as well as provide more services over a smaller set of base functions. If this is the case, why not simply build a single protocol running on a single network able to handle all the requirements potentially thrown at it and can handle any sequence of events you can imagine? A single network running a single protocol would certainly reduce the number of moving parts network engineers need to deal with, making all our lives simpler, right? In fact, there are a number of different ways to manage complexity, for instance:
1. 抽象掉复杂性,围绕系统的每个部分构建一个黑匣子,以便更容易理解每个部分以及这些部分之间的交互。
1. Abstract the complexity away, to build a black box around each part of the system, so each piece and the interactions between these pieces are more immediately understandable.
2. 将复杂性抛到隔间墙上——将问题从网络领域转移到应用程序、编码或协议领域。正如 RFC1925 所说,“转移问题(例如,将问题转移到整个网络架构的不同部分)比解决问题更容易。”
2. Toss the complexity over the cubicle wall—to move the problem out of the networking realm into the realm of applications, or coding, or a protocol. As RFC1925 says, “It is easier to move a problem around (e.g., by moving the problem to a different part of the overall network architecture) than it is to solve it.”
3. 在顶部添加另一层,通过在已有协议或隧道之上放置另一个协议或隧道,将所有复杂性视为黑匣子。回到 RFC1925,“总是可以添加另一个间接级别。”
3. Add another layer on top, to treat all the complexity as a black box by putting another protocol or tunnel on top of what’s already there. Returning to RFC1925, “It is always possible to add another level of indirection.”
4. 被复杂性淹没,将现有的东西贴上“遗留”的标签,并追逐一些被认为能够以更简单的方式解决所有问题的新的闪亮东西。
4. Become overwhelmed with the complexity, label what exists as “legacy,” and chase some new shiny thing perceived to be able to solve all the problems in a much less complex way.
5.忽略问题并希望它会消失。一个很好的例子是,主张“仅此一次”例外,以便在非常紧张的时间表内实现特定的业务目标或解决某些问题,并承诺“稍后”处理复杂性问题。
5. Ignoring the problem and hoping it will go away. Arguing for an exception “just this once,” so a particular business goal can be met, or some problem fixed, within a very tight schedule, with the promise that the complexity issue will be dealt with “later,” is a good example.
然而,这些解决方案中的每一个都需要考虑和管理一系列权衡。此外,在某些时候,任何复杂的系统都会变得脆弱——强大但脆弱。当一个系统能够对一组预期的情况做出弹性反应时,它是强大但脆弱的,但一组意外的情况将导致它失败。举一个现实世界的例子——刀片需要具有某种独特的特性组合。它们必须足够坚硬,可以固定边缘并进行切割,但又必须足够灵活,可以在使用时稍微弯曲,恢复到原来的形状,没有任何损坏的迹象,并且在跌落时不得破碎。需要多年的研究和经验才能找到合适的金属来制造刀片,并且仍然存在关于哪种材料适合特定性能、在什么条件下等的长期而深入的技术讨论。
Each of these solutions, however, has a set of tradeoffs to consider and manage. Further, at some point, any complex system becomes brittle—robust yet fragile. A system is robust yet fragile when it is able to react resiliently to an expected set of circumstances, but an unexpected set of circumstances will cause it to fail. To give an example from the real world—knife blades are required to have a somewhat unique combination of characteristics. They must be hard enough to hold an edge and cut, and yet flexible enough to bend slightly in use, returning to their original shape without any evidence of damage, and they must not shatter when dropped. It has taken years of research and experience to find the right metal to make a knife blade from, and there are still long and deeply technical discussions about which material is right for specific properties, under what conditions, etc.
“试图针对可预测问题建立网络证明往往会使其在处理不可预测问题时变得脆弱(通过您提到的僵化效应)。赋予同一网络尽可能强大的能力来防御不可预测的问题,必然意味着它不能对可预测的问题过于鲁棒。对于避免僵化问题,对可预测问题不太鲁棒是必要的,但不一定足以提供处理不可预测网络问题的强大能力。” ——托尼·普齐吉安达
“Trying to make a network proof against predictable problems tends to make it fragile in dealing with unpredictable problems (through an ossification effect as you mentioned). Giving the same network the strongest possible ability to defend itself against unpredictable problems, it necessarily follows, means that it MUST NOT be too terribly robust against predictable problems. Not being too robust against predictable problems is necessary to avoid the ossification issue, but not necessarily sufficient to provide for a robust ability to handle unpredictable network problems.” —Tony Przygienda
那么,复杂性是必要的:它无法被“解决”。
Complexity is necessary, then: it cannot be “solved.”
鉴于复杂性是必要的,工程师将需要学习通过查找或构建模型或框架以某种方式管理它。构建此类模型的最佳起点是解决最基本的问题:复杂性对于网络而言意味着什么?你能把一个网络放在一个秤上并让指针指向“复杂”吗?是否有一个数学模型可以将一组网络设备的配置和拓扑插入其中以产生“复杂性指数”?规模、弹性、脆弱性和优雅的概念与复杂性有何关系?构建模型的最佳起点是从示例开始。
Given complexity is necessary, engineers are going to need to learn to manage it in some way, by finding or building a model or framework. The best place to begin in building such a model is with the most fundamental question: What does complexity mean in terms of networks? Can you put a network on a scale and have the needle point to “complex”? Is there a mathematical model into which you can plug the configurations and topology of a set of network devices to produce a “complexity index”? How do the concepts of scale, resilience, brittleness, and elegance relate to complexity? The best place to begin in building a model is with an example.
什么是网络延伸?用最简单的术语来说,它是网络中最短路径与两点之间流量实际采用的路径之间的差异。图 1-9说明了这个概念。
What is network stretch? In the simplest terms possible, it is the difference between the shortest path in a network and the path that traffic between two points actually takes. Figure 1-9 illustrates this concept.
假设该网络中每条链路的成本为1,则路由器A和C之间的最短物理路径也将是最短逻辑路径:[A,B,C]。什么但是,如果 [A,B] 链接上的度量更改为 3,会发生什么情况?最短物理路径仍然是 [A,B,C],但最短逻辑路径现在是 [A,D,E,C]。最短物理路径和最短逻辑路径之间的差异是在路由器 A 和 C 之间转发的数据包必须经过的距离 - 在这种情况下,拉伸可以计算为 (4 [A,D,E,C])− (3 [A,B,C]),拉伸 1。
Assuming the cost of each link in this network is 1, the shortest physical path between Routers A and C will also be the shortest logical path: [A,B,C]. What happens, however, if the metric on the [A,B] link is changed to 3? The shortest physical path is still [A,B,C], but the shortest logical path is now [A,D,E,C]. The differential between the shortest physical path and the shortest logical path is the distance a packet being forwarded between Routers A and C must travel—in this case, the stretch can be calculated as (4 [A,D,E,C])−(3 [A,B,C]), for a stretch of 1.
测量拉伸的方式取决于在任何给定情况下最重要的是什么,但最常见的方法是通过比较网络的跳数,如此处示例中所使用的。在某些情况下,考虑沿两条路径的度量、沿两条路径的延迟或其他一些度量可能更重要,但重要的一点是在每个可能的路径上一致地测量它,以便在路径之间进行准确比较。
The way stretch is measured depends on what is most important in any given situation, but the most common way is by comparing hop counts through the network, as is used in the examples here. In some cases, it might be more important to consider the metric along two paths, the delay along two paths, or some other metric, but the important point is to measure it consistently across every possible path to allow for accurate comparison between paths.
有时很难区分物理拓扑和逻辑拓扑。在这种情况下,[A,B] 链路指标是否增加是因为该链路实际上是较慢的链路?如果是这样,那么这是否是拉伸的示例,还是简单地使逻辑拓扑与物理拓扑保持一致的示例是值得商榷的。
It is sometimes difficult to differentiate between the physical topology and the logical topology. In this case, was the [A,B] link metric increased because the link is actually a slower link? If so, whether this is an example of stretch, or an example of simply bringing the logical topology in line with the physical topology is debatable.
根据这一观察,从延伸角度来定义政策比几乎任何其他方式都要容易得多。策略是任何增加网络延伸的配置。例如,使用基于策略的路由(或流量工程)将流量从最短的物理路径推送到较长的逻辑路径以减少特定链路上的拥塞,这就是一种策略,它可以增加伸展度。
In line with this observation, it is much easier to define policy in terms of stretch than almost any other way. Policy is any configuration that increases the stretch of a network. Using Policy-Based Routing, or Traffic Engineering, to push traffic off the shortest physical path and onto a longer logical path to reduce congestion on specific links, for instance, is a policy—it increases stretch.
增加伸展力并不总是坏事。理解拉伸的概念可以帮助我们理解各种其他概念并建立一个框架复杂性和优化权衡。从物理上来说,最短路径并不总是最佳路径。
Increasing stretch is not always a bad thing. Understanding the concept of stretch simply helps us understand various other concepts and put a framework around complexity and optimization tradeoffs. The shortest path, physically speaking, is not always the best path.
在此图中,拉伸非常简单 - 它影响每个目的地以及流经网络的每个数据包。在现实世界中,事情更加复杂。拉伸实际上是按源/目标对进行的,因此很难在整个网络范围内进行测量。
Stretch, in this illustration, is very simple—it impacts every destination, and every packet flowing through the network. In the real world, things are more complex. Stretch is actually per source/destination pair, making it very difficult to measure on a network-wide basis.
三个组件——状态、优化和表面——几乎在每个网络或协议设计决策中都很常见。这些可以被视为一组权衡,如图 1-10所示并在随后的列表中进行描述。
Three components—state, optimization, and surface—are common in virtually every network or protocol design decision. These can be seen as a set of tradeoffs, as illustrated in Figure 1-10 and described in the list that follows.
• 不断增加的优化总是朝着更多状态或更多交互界面的方向发展。
• Increasing optimization always moves toward more state or more interaction surfaces.
• 减少状态总是朝着减少优化或增加交互面的方向发展。
• Decreasing state always moves toward less optimization or more interaction surfaces.
• 减少交互界面总是会导致更少的优化或更多的状态。
• Decreasing interaction surfaces always moves toward less optimization or more state.
当然,这些并不是铁定的规则。它们取决于特定的网络、协议和要求,但它们通常足够真实,足以使其成为理解复杂性权衡的有用模型。
These are no ironclad rules, of course; they are contingent on the specific network, protocols, and requirements, but they are generally true often enough to make this a useful model for understanding tradeoffs in complexity.
虽然状态和优化相当直观,但值得在交互界面上多花点时间。交互表面的概念很难掌握,主要是因为它涵盖了如此广泛的想法。也许举个例子会有帮助;假设一个函数
While state and optimization are fairly intuitive, it is worthwhile to spend just a moment more on interaction surfaces. The concept of interaction surfaces is difficult to grasp primarily because it covers such a wide array of ideas. Perhaps an example would be helpful; assume a function that
• 接受两个数字作为输入
• Accepts two numbers as input
• 添加它们
• Adds them
• 将所得总和乘以 100
• Multiplies the resulting sum by 100
• 返回结果
• Returns the result
这个单一功能可以被视为某些较大系统中的子系统。现在假设您将这个单个函数分解为两个函数,其中一个执行加法,另一个执行乘法。您创建了两个更简单的函数(每个函数只做一件事),但您还创建了两个函数之间的交互界面 - 您在系统中创建了两个交互子系统,而以前只有一个子系统。
This single function can be considered a subsystem in some larger system. Now assume you break this single function into two functions, one of which does the addition, and the other of which does the multiplication. You have created two simpler functions (each one only does one thing), but you have also created an interaction surface between the two functions—you have created two interacting subsystems within the system where there only used to be one.
作为另一个示例,假设您有两个控制平面在单个网络上运行。这两个控制平面之一携带网络外部可达的目的地信息(外部路由),而另一个控制平面携带网络内部可达的目的地信息(内部路由)。虽然这两个控制平面是不同的系统,它们仍然会以许多有趣且复杂的方式进行交互。例如,到外部目的地的可达性必然取决于到网络边缘之间的内部目的地的可达性。这两个控制平面现在必须协同工作,构建一个完整的信息表,可用于通过网络转发数据包。
As another example, assume you have two control planes running on a single network. One of these two control planes carries information about destinations reachable outside the network (external routes), while the other carries destinations reachable inside the network (internal routes). While these two control planes are different systems, they will still interact in many interesting and complex ways. For instance, the reachability to an external destination will necessarily depend on reachability to the internal destinations between the edges of the network. These two control planes must now work together to build a complete table of information that can be used to forward packets through the network.
即使在单个控制平面内通信的两个路由器也可以被视为交互表面。这种定义的广度使得定义交互表面变得非常困难。
Even two routers communicating within a single control plane can be considered an interaction surface. This breadth of definition is what makes it so very difficult to define what an interaction surface is.
交互界面并不是一件坏事;它们帮助工程师和设计师在任何给定的问题空间(从建模到实现)中分而治之。与此同时,交互界面很容易不加思考地引入。
Interaction surfaces are not a bad thing; they help engineers and designers divide and conquer in any given problem space, from modeling to implementation. At the same time, interaction surfaces are all too easy to introduce without thought.
黄蜂腰或沙漏模型在自然界中广泛使用,并在工程界广泛模仿。虽然工程师并不经常有意识地应用这个模型,但实际上它一直在使用。图 1-11说明了四层国防部 (DoD) 模型背景下的沙漏模型,该模型催生了互联网协议 (IP) 套件。
The wasp waist, or hourglass model, is used throughout the natural world, and widely mimicked in the engineering world. While engineers do not often consciously apply this model, it is actually used all the time. Figure 1-11 illustrates the hourglass model in the context of the four-layer Department of Defense (DoD) model that gave rise to the Internet Protocol (IP) suite.
在底层,即物理传输系统,有各种各样的协议,从以太网到卫星。在顶层,信息被整理并呈现给应用程序,有各种各样的协议,从超文本传输协议 (HTTP) 到 TELNET(以及数千种其他协议)。然而,当您向堆栈中间移动时,会发生一件有趣的事情:协议数量减少,形成沙漏。为什么这可以控制复杂性?回顾复杂性的三个组成部分——状态、表面和复杂性——揭示了沙漏和复杂性之间的关系。
At the bottom layer, the physical transport system, there are a wide array of protocols, from Ethernet to Satellite. At the top layer, where information is marshaled and presented to applications, there is a wide array of protocols, from Hypertext Transfer Protocol (HTTP) to TELNET (and thousands of others besides). A funny thing happens when you move toward the middle of the stack, however: the number of protocols decreases, creating an hourglass. Why does this work to control complexity? Going back through the three components of complexity—state, surface, and complexity—exposes the relationship between the hourglass and complexity.
• 状态按沙漏分为两种不同类型的状态:有关网络的信息和有关通过网络传输的数据的信息。上层关注以可用的方式编组和呈现信息,而下层关注发现存在什么连接以及连接属性实际上是什么。下层不需要知道如何格式化 FTP 帧,上层也不需要知道如何通过以太网传输数据包——模型两端的状态都会减少。
• State is divided by the hourglass into two distinct types of state: information about the network and information about the data being transported across the network. While the upper layers are concerned with marshaling and presenting information in a usable way, the lower layers are concerned with discovering what connectivity exists and what the connectivity properties actually are. The lower layers do not need to know how to format an FTP frame, and the upper layers do not need to know how to carry a packet over Ethernet— state is reduced at both ends of the model.
• 通过将各个组件之间的交互点数量减少到精确的一个(互联网协议(IP))来控制表面。这个单一交互点可以通过标准流程来很好地定义,并严格监管一个交互点的变化,以防止反映协议栈上下的大规模快速变化。
• Surfaces are controlled by reducing the number of interaction points between the various components to precisely one—the Internet Protocol (IP). This single interaction point can be well defined through a standards process, with changes in the one interaction point closely regulated to prevent massive rapid changes that will reflect up and down the protocol stack.
• 通过允许一层访问另一层以及对应用程序隐藏网络状态来权衡优化。例如,除了从本地信息收集的信息之外,TCP 并不真正了解网络的状态。TCP 在使用网络资源方面可能会更加高效,但代价是违反层数,从而打开难以控制的交互界面。
• Optimization is traded off by allowing one layer to reach into another layer, and by hiding the state of the network from the applications. For instance, TCP does not really know the state of the network other than what it can gather from local information. TCP could potentially be much more efficient in its use of network resources, but only at the cost of a layer violation, which opens up difficult-to-control interaction surfaces.
因此,堆叠网络模型的分层是控制网络各个交互组件的复杂性的直接尝试。
The layering of a stacked network model is, then, a direct attempt to control the complexity of the various interacting components of a network.
本章无意提供细节,而是在计算机网络技术历史范围内构建关键术语。计算机网络世界并没有悠久的历史(例如,人类的历史至少可以追溯到6000年前,甚至可能有数百万年,这取决于你的观点),但这段历史仍然包含着一系列的曲折和坎坷的道路,通常使普通人难以理解事物如何以及为何如此运作。
This chapter is not intended to provide detail, but rather to frame key terms within the scope of the history of computer network technology. The computer networking world does not have a long history (for example, human history reaches back at least 6,000 years, and potentially many millions, depending on your point of view), but this history still contains a set of switchback turns and bumpy pathways, often making it difficult for the average person to understand how and why things work the way they do.
有了这个介绍,现在是时候转向了解网络如何真正工作的第一个感兴趣的主题了——数据平面。
With this introduction in hand, it is time to turn to the first topic of interest in understanding how networks really work—the data plane.
布鲁尔、埃里克. “迈向稳健的分布式系统。” 在 ACM 分布式计算原理研讨会上发表,2000 年 7 月 19 日。http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf。
Brewer, Eric. “Towards Robust Distributed Systems.” Presented at the ACM Symposium on the Principles of Distributed Computing, July 19, 2000. http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf.
Buckwalter, Jeff T.帧中继:技术与实践。第一版。马萨诸塞州雷丁:Addison-Wesley Professional,1999。
Buckwalter, Jeff T. Frame Relay: Technology and Practice. 1st edition. Reading, MA: Addison-Wesley Professional, 1999.
瑟夫 (Vinton G.) 和爱德华·凯恩 (Edward Cain)。“国防部互联网架构模型。” 计算机网络 7 (1983): 307–18。
Cerf, Vinton G., and Edward Cain. “The DoD Internet Architecture Model.” Computer Networks 7 (1983): 307–18.
戈雷尔、迈克. “盐湖县数据泄露暴露了 14,200 人的信息。” 盐湖论坛报。访问日期:2017 年 4 月 23 日。http: //www.sltrib.com/home/3705923-155/data-breach-exposed-info-of-14200。
Gorrell, Mike. “Salt Lake County Data Breach Exposed Info of 14,200 People.” The Salt Lake Tribune. Accessed April 23, 2017. http://www.sltrib.com/home/3705923-155/data-breach-exposed-info-of-14200.
Ibe, Oliver C.融合网络架构:通过 IP、ATM 和帧中继传送语音。第一版。纽约:威利,2001。
Ibe, Oliver C. Converged Network Architectures: Delivering Voice over IP, ATM, and Frame Relay. 1st edition. New York: Wiley, 2001.
库马尔,巴拉吉。宽带通信:ATM、帧中继、SMD、Sonet 和 Bisbn 专业指南。纽约:麦格劳-希尔,1995。
Kumar, Balaji. Broadband Communications: A Professional’s Guide to ATM, Frame Relay, SMDs, Sonet, and Bisbn. New York: McGraw-Hill, 1995.
“局域网仿真。” 微软技术网。访问日期:2017 年 8 月 4 日。https: //technet.microsoft.com/en-us/library/cc976969.aspx。
“LAN Emulation.” Microsoft TechNet. Accessed August 4, 2017. https://technet.microsoft.com/en-us/library/cc976969.aspx.
“LAN 仿真 (LANE)。” 思科。访问日期:2017 年 8 月 4 日。http ://www.cisco.com/c/en/us/tech/asynchronous-transfer-mode-atm/lan-emulation-lane/index.html。
“LAN Emulation (LANE).” Cisco. Accessed August 4, 2017. http://www.cisco.com/c/en/us/tech/asynchronous-transfer-mode-atm/lan-emulation-lane/index.html.
迈克尔·A·帕德利斯基 (Padlipsky),《网络风格的要素》以及有关计算机间网络艺术的其他论文和动画。普伦蒂斯·霍尔,1985。
Padlipsky, Michael A. The Elements of Networking Style and Other Essays and Animadversions on the Art of Intercomputer Networking. Prentice-Hall, 1985.
Russell, Andrew L.“OSI:不存在的互联网”。专业组织。IEEE Spectrum,2016 年 9 月 27 日。https: //spectrum.ieee.org/tech-history/cyberspace/osi-the-internet-that-wasnt
Russell, Andrew L. “OSI: The Internet That Wasn’t.” Professional Organization. IEEE Spectrum, September 27, 2016. https://spectrum.ieee.org/tech-history/cyberspace/osi-the-internet-that-wasnt
“了解 ATM VC 的 CBR 服务类别。” 思科。访问日期:2017 年 6 月 10 日。http ://www.cisco.com/c/en/us/support/docs/asynchronous-transfer-mode-atm/atm-traffic-management/10422-cbr.html。
“Understanding the CBR Service Category for ATM VCs.” Cisco. Accessed June 10, 2017. http://www.cisco.com/c/en/us/support/docs/asynchronous-transfer-mode-atm/atm-traffic-management/10422-cbr.html.
怀特、拉斯和杰夫·坦苏拉。应对网络复杂性:利用 SDN、服务虚拟化和服务链的下一代路由。印第安纳州印第安纳波利斯:Addison-Wesley Professional,2015。
White, Russ, and Jeff Tantsura. Navigating Network Complexity: Next-Generation Routing with SDN, Service Virtualization, and Service Chaining. Indianapolis, IN: Addison-Wesley Professional, 2015.
1. 可以清楚地看到不同业务假设的一个特定领域是选择使用少量大型网络设备(例如支持多个线卡的基于机箱的路由器)或使用大量较小的设备(因此称为披萨盒,或一个机架单元,具有固定数量可用接口的路由器)来构建园区或数据中心网络。列出可能使一种选择比另一种选择更昂贵的许多不同因素,然后解释什么样的业务条件可能决定对这两种选择使用一种而不是另一种。
1. One specific realm where different business assumptions can be clearly seen is in choosing to use a small number of large network devices (such as a chassis-based router that supports multiple line cards) or using a larger number of smaller devices (so-called pizza box, or one rack unit, routers having a fixed number of interfaces available) to build a campus or data center network. List a number of different factors that might make one option more expensive than the other, and then explain what sorts of business conditions might dictate the use of one instead of the other for both options.
2. 软件应用程序中代码膨胀的一种“外部表现”是书呆子旋钮;虽然书呆子旋钮有很多定义,但它们通常被认为是一种配置命令,它将修改协议或设备操作方式中的一些小的、特定的操作点。实际上有一些关于书呆子旋钮的危害的研究论文和在线讨论;您还可以找到多年来跨多个软件版本的各种网络设备的命令集。为了了解网络设备复杂性的增长,请跟踪可用命令的数量,并尝试判断其中有多少命令被视为书呆子旋钮,而不是主要功能。您可以从这些信息中收集到什么信息吗?
2. One “outside representation” of code bloat in software applications is nerd knobs; while there are many definitions of a nerd knob, they are generally considered a configuration command that will modify some small, specific, point of operation in the way a protocol or device operates. There are actually some research papers and online discussions around the harm from nerd knobs; you can also find command sets from various network devices across a number of software releases through many years. In order to see the growth in complexity in network devices, trace the number of available commands, and try to judge how many of these would be considered nerd knobs versus major features. Is there anything you can glean from this information?
3. TDM 不是唯一可用的复用类型;还有频分复用(FDM)。FDM 是否可以像 TDM 一样用于划分通道?为什么或者为什么不?
3. TDM is not the only kind of multiplexing available; there is also Frequency Division Multiplexing (FDM). Would FDM be useful for dividing a channel in the same way that TDM is? Why or why not?
4. 什么是反向多路复用器,它的用途是什么?
4. What is an inverse multiplexer, and what would it be used for?
5. 阅读“进一步阅读”部分中有关 ATM LAN 仿真 (LANE) 的两篇参考资料。从复杂性模型中描述该解决方案的复杂性;在哪里添加了状态和交互表面,每次添加都获得了什么样的优化?您认为 ATM LANE 解决方案在提供其设计的服务类型与共享以太网之类的服务方面是否提供了一系列良好的权衡?
5. Read the two references to ATM LAN Emulation (LANE), in the “Further Reading” section. Describe the complexity in this solution from within the complexity model; where are state and interaction surfaces added, and what sort of optimization is being gained with each addition? Do you think the ATM LANE solution presents a good set of tradeoffs for providing the kinds of ser-vices it is designed to offer versus something like a shared Ethernet network?
6. 用人性化的语言描述为什么延迟和抖动在实时(交互式)语音和视频通信中很糟糕。这些相同的问题是否会出现在稍后存储和播放的录制语音和视频中?为什么或者为什么不?
6. Describe, in human terms, why delay and jitter are bad in real time (interactive) voice and video communications. Would these same problems apply to recorded voice and video stored and played back at some later time? Why or why not?
7. 实时(交互式)语音和视频使用网络与大文件传输有何不同?是否有特定的点可以比较这两种流量,并描述网络可能需要如何对每种流量类型做出不同的反应?
7. How would real-time (interactive) voice and video use the network differently than a large file transfer? Are there specific points at which you can compare the two kinds of traffic, and describe how the network might need to react differently to each traffic type?
8. 文中声称“黄蜂腰”是自然界中用于管理复杂性的常见策略。在自然界中找几个例子。至少研究 TCP/IP 之外的一组协议(协议栈),例如 Banyan Vines、Novell 的 IPX 或 OSI 系统。这些协议中是否也存在“黄蜂腰”?它是什么?
8. The text claims the “wasp waist” is a common strategy used in nature to manage complexity. Find several examples in nature. Research at least one other set of protocols (protocol stack) than TCP/IP, such as Banyan Vines, Novell’s IPX, or the OSI system. Is there a “wasp waist” in these sets of protocols, as well? What is it?
9. 其他计算领域是否也存在黄蜂腰,例如个人电脑或移动计算设备(例如平板电脑和手机)中使用的操作系统?你能认出他们吗?
9. Are there wasp waists in other areas of computing, such as the operating systems used in personal computers, or mobile computing devices (such as tablets and mobile phones)? Can you identify them?
10. 研究一些反对从 IPv6 互联网协议中删除分段的论点。总结双方的观点。您同意最终删除碎片的决定吗?
10. Research some of the arguments against removing fragmentation from the Internet Protocol in IPv6. Summarize the points made by each side. Do you agree with the final decision to remove fragmentation?
1 . Padlipsky,网络风格的元素以及其他关于计算机间网络艺术的论文和动画(纽约:Prentice-Hall,1985)。
1. Padlipsky, The Elements of Networking Style and Other Essays and Animadversions on the Art of Intercomputer Networking (New York: Prentice-Hall, 1985).
当传输协议梦想时,他们梦想的是应用程序吗?它们可能应该这样做,因为网络的主要目的是支持应用程序,而应用程序需要从网络获得的主要资源是将数据从一个进程(或处理器)移动到另一个进程(或处理器)。但是数据如何通过电线、空气或光缆传输呢?
When transport protocols dream, do they dream of applications? They probably should, as the primary purpose of a network is to support applications—and the primary resource that applications need from a network is data moved from one process (or processor) to another. But how can data be transmitted over a wire, or through the air, or over an optical cable?
也许最好从一个更熟悉的例子开始:人类语言。本书的作者使用格式、语言和词汇来编写本书,使您能够阅读和理解所提供的信息。语言需要克服哪些问题才能使写作和阅读等交流成为可能?
Perhaps it is best to begin with a more familiar example: human language. The authors of this book wrote it using formatting, language, and vocabulary enabling you to read and understand the information presented. What problems does a language need to overcome to make the communication, this writing and reading, possible?
思想必须以一种允许接收者检索的形式来捕获。在人类语言中,信息被打包成单词、句子、段落、章节和书籍。该划分的每个级别都意味着一些信息单元和一些组织系统。例如,声音或想法被封装成字母或符号;然后声音或想法被组合成单词;单词被组合成句子等。句子遵循特定的语法形式,因此您可以从符号中解码含义。将思想和信息编码为格式化的符号,使读者(接收者)能够检索原始含义,这在本书中称为编组数据。
Thoughts must be captured in a form that allows them to be retrieved by a receiver. In human languages, information is packaged into words, sentences, paragraphs, chapters, and books. Each level of this division implies some unit of information, and some organizational system. For instance, sounds or ideas are encapsulated into letters or symbols; sounds or ideas are then combined into words; words are combined into sentences, etc. Sentences follow a particular grammatical form so you can decode the meaning from the symbols. This encoding of thoughts and information into symbols formatted which allows a reader (receiver) to retrieve the original meaning will be called marshaling the data in this book.
编组的一个方面是定义性的——将一组符号与特定含义相关联的过程。元数据或有关数据的数据使您能够了解如何解释流或流中的信息。
One aspect of marshaling is definitional—the process of associating one set of symbols to a particular meaning. Metadata, or data about the data, allows you to understand how to interpret information in a flow or stream.
必须有某种方法来管理传输或接收中的错误。假设您有一只喜欢追逐特定球的宠物狗。有一天,球从篮子里掉出来,弹到街上。狗追逐着,似乎直接冲向迎面驶来的汽车的路径。你做什么工作?也许你大喊“停止!”——然后也许是“不!”——也许是“留下来!” 使用几个应该导致相同动作的命令(狗在跑到街上之前停下来)可以确保狗正确接收并理解了该消息。你希望,喊出多个信息将确保你告诉狗做的事情不会被误解。
There must be some way of managing errors in transmission or reception. Sup-pose you have a pet dog who likes to chase after a particular ball. The ball drops out of a basket one day, and bounces into the street. The dog chases and appears to be heading directly into the path of an oncoming car. What do you do? Perhaps you shout “Stop!”— and then maybe “No!”—and perhaps “Stay!” Using several commands that should result in the same action—the dog stopping before he runs into the street—is making certain the dog has correctly received, and understood, the message. Shouting multiple messages will, you hope, ensure there is no misunderstanding in what you are telling the dog to do.
事实上,这是一种纠错形式。人类语言中内置了多种纠错方式。例如,yu cn prbbly stll rd ths sntnce。人类语言过度指定了它们所包含的信息,因此丢失一些字母不会导致整个消息丢失。这种超规格可以被认为是前向纠错的一种形式。然而,这并不是人类语言所包含的唯一纠错形式。它们还包含问题,可以要求验证、确认或获取先前通过语言“传输”的信息的缺失位或上下文。
This is, in fact, a form of error correction. There are many kinds of error correction built into human language. For instance, yu cn prbbly stll rd ths sntnce. Human languages overspecify the information they contain, so a few missed letters do not cause the entire message to be lost. This overspecification can be considered a form of forward error correction. This is not the only form of error correction human languages contain, however. They also contain questions, which can be asked to verify, validate, or gain missing bits or context of information previously “transmitted” through the language.
必须有某种方式可以在一大群人中使用单一媒介(空气)与一个人或一小群人交谈。需要在挤满人的房间里与一个人交谈并不罕见。人类语言在许多情况下都建立了处理这个问题的方法,例如呼唤某人的名字,或者大声说话以让你直接面对的人听到(换句话说,语言的实现可以是定向的)。与多人中的一个人或特定的一组人交谈的能力是多路复用的。
There must be some way to talk to one person, or a small group of people, using a single medium—air—within a larger crowd. It is not uncommon to need to talk to one person out of a room full of people. Human language has built in ways of dealing with this problem in many situations, such as calling someone’s name, or speaking loudly enough to be heard by the person you are directly facing (the implementation of language can be directional, in other words). The ability to speak to one person among many, or a specific subset of people, is multiplexing.
最后,必须有某种方法来控制对话的流程。对于一本书来说,这是一件简单的事情;对于一本书来说,这是一件简单的事情。作者将文本分为几部分,然后将其收集成读者可以以完全不同的速度阅读和重读的格式。没有多少人认为书籍是流量控制的一种形式,但将思想转化为书面形式是将发送者的速度(写入的速度)与接收者的速度(阅读的速度)脱节的有效方法。口语还有其他形式的语流控制,比如“嗯”,当听者失去了说话者正在遵循的推理路线时,她的眼神呆滞,甚至是指示说话者应该放慢速度的肢体动作。
Finally, there must be some way to control the flow of a conversation. With a book, this is a simple matter; the writer produces text in parts, which are then collected into a format the reader can read, and reread, at a completely different pace. Not many people think of a book as a form of flow control, but putting thoughts into written form is an effective way to disconnect the speed of the sender (the speed of writing) from the speed of the receiver (the speed of reading). Spoken language has other forms of flow control, such as “um,” and the glazed-over look in a listener’s eyes when she has lost the line of reasoning a speaker is following, or even physical gestures indicating the speaker should slow down.
总结起来,成功的通信系统需要解决四个问题:
To summarize, successful communication systems need to solve four problems:
• 整理数据;将想法转化为接收者能够理解的符号和语法
• Marshaling the data; converting ideas into symbols and a grammar the receiver will understand
• 管理错误,以便将想法正确地从发送者传递到接收者
• Managing errors, so the ideas are correctly transmitted from the sender to the receiver
• 多路复用,或允许将公共媒体或基础设施用于许多不同的发送者和接收者对之间的对话
• Multiplexing, or allowing a common media or infrastructure to be used for conversations between many different pairs of senders and receivers
• 流量控制,或者在发送者传输更多信息之前确保接收者确实正在接收和处理信息的能力
• Flow control, or the ability to make certain the receiver is actually receiving and processing the information before the sender transmits more
以下部分将研究每个问题以及每个问题空间中可用的一些解决方案。
The following sections examine each of these problems as well as some of the solutions available in each problem space.
考虑一下您阅读本书的过程。您检查一组为与纸上墨水等物理载体形成对比而创建的标记。这些标记代表某些符号(或者,如果您正在听这本书,则代表白噪声背景上的某些声音),然后您将其解释为字母。反过来,您可以使用间距和布局规则将这些字母组合在一起形成单词。单词通过标点符号和空格可以组成句子。
Consider the process you are using to read this book. You examine a set of marks created to contrast with a physical carrier, ink on paper. These marks represent certain symbols (or, if you are hearing this book, certain sounds on a white noise background), which you then interpret as letters. These letters, in turn, you can put together using rules of spacing and layout to form words. Words, through punctuation and spacing, you can form into sentences.
在该过程的每个阶段都有几种相互作用的事物:
At each stage in the process there are several kinds of things interacting:
• 可以施加信号的物理载体。这项针对承运人表示信息的工作以克劳德·香农 (Claude Shannon) 的工作为基础,超出了本书的范围;建议有兴趣的人进一步阅读以下部分。
• A physical carrier onto which the signal can be imposed. This work of representing information against a carrier is grounded in the work of Claude Shannon, and is outside the scope of this book; further reading is suggested in the following section for those who are interested.
• 信息单元的符号表示,用于将物理符号转换为第一层逻辑内容。当您解释符号时,需要两件事:字典,它描述可以对应于特定物理状态的可能逻辑符号的范围;以及语法,它描述如何确定哪个逻辑符号与该物理状态实例相关。这两件事结合起来,可以被描述为一个协议。
• A symbolic representation of units of information used to translate the physical symbols into the first layer of logical content. When you are interpreting symbols, two things are required: a dictionary, which describes the range of possible logical symbols that can correspond to a certain physical state, and a grammar, which describes how to determine which logical symbol relates to this instance of physical state. These two things, combined, can be described as a protocol.
• 将符号转换为单词,然后将单词转换为句子的方法。同样,这将由两个组件组成:字典和语法。同样,这些可以被描述为协议。
• A way to convert the symbols into words and then the words into sentences. Again, this will consist of two components, a dictionary and a grammar. Again, these can be described as protocols.
当你“向上移动”时,从物理到字母、单词到句子等,字典将变得不那么重要,而允许你将上下文转换为含义的语法则变得更重要——但是这两件事存在于阅读和/或听力过程的每一层。字典和语法被认为是两种不同形式的元数据,您可以使用它们将物理表示转换为句子、思想、论证思路等。
As you move “up the stack,” from the physical to the letters to the words to the sentences, etc., the dictionary will become less important, and the grammar, which allows you to convert the context into meaning, more important—but these two things exist at every layer of the reading and/or listening process. The dictionary and grammar are considered two different forms of metadata you can use to turn physical representations into sentences, thoughts, lines of argument, etc.
人类语言(例如您现在正在阅读的语言)与数字语言之间确实没有太大区别。然而,数字语言并不被称为语言;它是一种语言。它被称为协议。更正式地说:
There really is not much difference between a human language, such as the one you are reading right now, and a digital language. A digital language is not called a language, however; it is called a protocol. More formally:
协议是一个字典和一个语法(元数据),用于将一种信息翻译成另一种信息。
A protocol is a dictionary and a grammar (metadata) used to translate one kind of information into another.
当然,协议不仅仅在一个方向上发挥作用;它们可用于编码和解码信息。语言可能是您每天遇到的最常见的协议形式,但还有许多其他形式,例如交通标志;烤面包机、计算机和移动设备上的用户界面;和每一种人类语言。
Protocols do not work in just one direction, of course; they can be used to encode as well as decode information. Languages are probably the most common form of protocol you encounter on a daily basis, but there are many others, such as traffic signs; the user interfaces on your toaster, computer, and mobile devices; and every human language.
假设您正在开发一个协议,这主要意味着开发一个字典和一个语法,您可以进行两种优化:
Given you are developing a protocol, which primarily means developing a dictionary and a grammar, there are two kinds of optimization you can work toward:
•资源效率。编码任何特定信息位时使用了多少资源?与数据本身一起内嵌的元数据越多,编码的效率就越高,但更多的实现将依赖字典来解码信息。使用非常小的信号来编码大量信息的协议通常被认为是紧凑的。
• Resource Efficiency. How many resources are used in encoding any particular bit of information? The more metadata included inline, with the data itself, the more efficient the encoding will be—but the more implementations will rely on dictionaries to decode the information. Protocols that use very small signals to encode a lot of information are generally considered compact.
•灵活性。在现实世界中,事情会发生变化。协议必须以某种方式设计来应对变化,希望不需要“国旗日”来升级协议。
• Flexibility. In the real world, things change. Protocols must somehow be designed to deal with change, hopefully in a way not requiring a “flag day” to upgrade the protocol.
元数据权衡是您在网络工程中会发现的众多权衡之一;要么包含更多的元数据,使协议能够更好地处理未来的需求,要么包含更少的元数据,使协议更加高效和紧凑。一个好的经验法则是:如果你还没有找到权衡,那么你还没有足够努力地寻找,这在本书中你会看到多次重复。
The metadata tradeoff is one of many you will find in network engineering; either include more metadata, allowing the protocol to better handle future requirements, or include less metadata, making the protocol more efficient and compact. A good rule of thumb, one you will see repeated many times throughout this book, is: if you have not found the tradeoff, you have not looked hard enough.
协议中的字典是符号和操作的数字模式表。也许最常用的数字词典是字符代码。表 2-1复制了部分 Unicode 字符代码字典。
A dictionary in a protocol is a table of digital patterns to symbols and operations. Perhaps the most commonly used digital dictionaries are character codes. Table 2-1 replicates part of the Unicode character code dictionary.
Table 2-1 A Partial Unicode Dictionary or Table
代码 Code |
字形 Glyph |
十进制 Decimal |
描述 Description |
# # |
U+0030 U+0030 |
0 0 |
0 0 |
数字零 Digit Zero |
0017 0017 |
U+0031 U+0031 |
1 1 |
1 1 |
数字一 Digit One |
0018 0018 |
U+0032 U+0032 |
2 2 |
2 2 |
数字二 Digit Two |
0019 0019 |
U+0033 U+0033 |
3 3 |
3 3 |
数字三 Digit Three |
0020 0020 |
U+0034 U+0034 |
4 4 |
4 4 |
数字四 Digit Four |
0021 0021 |
U+0035 U+0035 |
5 5 |
5 5 |
数字五 Digit Five |
第0022章 0022 |
U+0036 U+0036 |
6 6 |
6 6 |
数字六 Digit Six |
第0023章 0023 |
U+0037 U+0037 |
7 7 |
7 7 |
数字七 Digit Seven |
第0024章 0024 |
U+0038 U+0038 |
8 8 |
8 8 |
数字八 Digit Eight |
第0025章 0025 |
U+0039 U+0039 |
9 9 |
9 9 |
数字九 Digit Nine |
第0026章 0026 |
U+003A U+003A |
: : |
: : |
冒号 Colon |
第0027章 0027 |
U+003B U+003B |
; ; |
; ; |
分号 Semicolon |
第0028章 0028 |
U+003C U+003C |
< < |
< < |
小于号 Less-than sign |
第0029章 0029 |
使用表 2-1,如果计算机正在“读取”表示一系列字母的数组,如果数组中的数字是 0023,它将打印出(或在处理中处理)数字6 ,如果数组中的数字是数字 7,则它将打印出(或处理)数字7 。数组是 0024 等。这个表或字典将特定数字与字母表中的特定符号相关联,就像字典将单词与一系列含义相关联一样。
Using Table 2-1, if a computer is “reading” an array representing a series of letters, it will print out (or treat in processing) the number 6 if the number in the array is 0023, the number 7 if the number in the array is 0024, etc. This table, or dictionary, relates specific numbers to specific symbols in an alphabet, just like a dictionary relates a word to a range of meanings.
计算机如何确定香蕉的价格与“ banana”一词中的字母之间的差异?通过信息的上下文。例如,所讨论的数组可能存储为字符串或一系列字母;存储为字符串变量类型的数组提供元数据或上下文,指示这些特定内存位置中的值应被视为字母而不是数组中包含的数值。由计算机执行的元数据提供了协议的语法。
How can the computer determine the difference between the price of a banana and the letters in the word banana? Through the context of the information. For instance, perhaps the array in question is stored as a string, or a series of letters; the array being stored as a string variable type provides the metadata, or the context, which indicates the values in these particular memory locations should be treated as letters rather than the numeric values contained in the array. This metadata, acted on by the computer, provides the grammar of the protocol.
在协议中,字典通常用数据包中包含的任何特定字段来表示,语法通常用数据包的构建方式或数据包中的哪些位置包含哪些字段来表示。
In protocols, dictionaries are often expressed in terms of what any particular field in a packet contains, and grammars are often expressed in terms of how the packet is built, or what fields are contained at what locations in a packet.
有多种方法可以构建字典和基本(第一级)语法;以下几节将考虑其中的一些内容。
There are several ways to build dictionaries and basic (first-level) grammars; several of these will be considered in the following sections.
固定长度字段是解释起来最简单的字典机制。该协议定义了一组字段,每个字段包含什么类型的数据,以及每个字段有多大。这些信息被“融入”协议定义中,因此每个实现都是按照这些相同的规范构建的,因此可以彼此互操作。图 2-1说明了开放最短路径优先 (OSPF) 协议中使用的固定长度字段编码(取自 RFC2328)。4
Fixed length fields are the simplest of the dictionary mechanisms to explain. The protocol defines a set of fields, what kind of data each field contains, and how large each field is. This information is “baked into” the protocol definition, so every implementation is built to these same specifications, and hence can interoperate with one another. Figure 2-1 illustrates a fixed length field encoding used in the Open Shortest Path First (OSPF) protocol taken from RFC2328.4
图 2-1顶部的一行数字表示数据包格式中的各个位;每行包含 32 位信息。前 8 位表示版本号,后 8 位始终为数字 5,接下来的 16 位包含总数据包长度等。这些字段中的每一个都在协议规范中进一步定义,其中包含协议中携带的信息类型。字段及其编码方式。例如:
The row of numbers across the top of Figure 2-1 indicates the individual bits in the packet format; each row contains 32 bits of information. The first 8 bits indicate the version number, the second 8 bits always have the number 5, the following 16 bits contain the total packet length, etc. Each of these fields is further defined in the protocol specification with the kind of information carried in the field and how it is encoded. For instance:
• 版本号字段被编码为无符号整数。这是指示用于该数据包的字典和语法的元数据。如果需要更改数据包格式,则可以增加版本号,从而允许发送方和接收方在对数据包中的信息进行编码和解码时使用正确的字典和语法。
• The version number field is encoded as an unsigned integer. This is metadata indicating the dictionary and grammar used for this packet. If the packet format needs to be changed, the version number can be increased, allowing transmitters and receivers to use the correct dictionary and grammar when encoding and decoding the information in the packet.
• 数字5表示协议内数据包的类型;这是标准文档中其他地方定义的字典的一部分,因此它只是作为固定值插入到此图中。该特定数据包是链路状态确认数据包。
• The number 5 indicates the kind of packet within the protocol; this is part of a dictionary defined elsewhere in the standards document, so it is simply inserted as a fixed value in this illustration. This particular packet is a Link State Acknowledgment Packet.
• 数据包长度被编码为无符号整数,指示完整数据包中包含的八位位组(或 8 位组)的数量。这允许数据包大小根据需要携带的信息量而变化。
• The packet length is encoded as an unsigned integer indicating the number of octets (or sets of 8 bits) contained in the complete packet. This allows the packet size to vary in length depending on how much information needs to be carried.
固定长度字段格式有几个优点。首先,数据包内任何信息的位置在数据包之间都是相同的,这意味着很容易优化设计用于对数据包格式的信息进行编码和解码的代码。例如,处理固定长度数据包格式的常见方法是在内存中创建与数据包格式精确匹配的数据结构;当从线路上读取数据包时,它只是被复制到该数据结构中。然后可以直接读取数据包内的字段。
The fixed length field format has several advantages. Primarily, the location of any piece of information within the packet will be the same from packet to packet, which means it is easy to optimize the code designed to encode and decode the information around the packet format. For instance, a common way of processing a fixed length packet format is to create an in-memory data structure matching the packet format precisely; when the packet is read off the wire, it is simply copied into this data structure. The fields within the packet can then be read directly.
固定长度格式往往比较紧凑。编码和解码数据所需的元数据以协议规范的形式在“协议之外”进行。数据包本身仅包含值,而不包含有关值的任何信息。另一方面,固定长度格式可能会通过缓冲字段来浪费大量空间,因此它们的长度始终相同。例如,十进制数1可以用单个二进制数(一位)来表示,而十进制数4则需要3个二进制数(三位);如果固定长度字段必须能够表示 0 到 4 之间的任何数字,则它需要至少 3 位长,即使其中两个位有时会在表示较小的十进制数时被“浪费”。
Fixed length formats tend to be somewhat compact. The metadata needed to encode and decode the data is carried “outside the protocol,” in the form of a protocol specification. The packets themselves contain only the value, and never any information about the values. On the other hand, fixed length formats can waste a lot of space by buffering the fields so they are always the same length. For instance, the decimal number 1 can be represented with a single binary digit (a single bit), while the decimal number 4 requires 3 binary digits (three bits); if a fixed length field must be able to represent any number between 0 and 4, it will need to be at least 3 bits long, even though two of those bits will sometimes be “wasted” in representing smaller decimal numbers.
固定长度格式还经常通过在公共处理器内存边界上对齐字段大小来提高处理速度而浪费空间。需要采用 0 到 3 之间的值的字段,即使它只需要两位来表示完整的值集,也可以编码为 8 位字段(完整的八位字节),以确保后面的字段始终对齐在八位字节边界上以实现更快的内存处理。
Fixed length formats also often waste space by aligning the field sizes on common processor memory boundaries to improve the speed of processing. A field required to take values between 0 and 3, even though it only needs two bits to represent the full set of values, may be encoded as an 8-bit field (a full octet) in order to ensure the field following is always aligned on an octet boundary for faster in-memory processing.
灵活性是固定长度编码经常遇到问题的地方。如果某些字段在原始规范中被定义为 8 位值(单个八位字节),则没有明显的方法来修改字段的长度以支持新的要求。在固定长度编码方案中解决这个问题的主要方法是通过版本号。如果必须更改字段的长度,则在支持新字段长度的数据包格式中修改版本号。这允许实现使用旧格式,直到网络中的所有设备都升级为支持新格式;一旦它们全部升级,整个系统就可以切换到新的格式,无论是更大还是更小。
Flexibility is where fixed length encoding often runs into problems. If some field is defined as an 8-bit value (a single octet) in the original specification, there is no obvious way to modify the length of the field to support new requirements. The primary way this problem is solved in fixed length encoding schemes is through the version number. If the length of a field must be changed, the version number is modified in packet formats supporting the new field length. This allows implementations to use the old format until all the devices in the network are upgraded to support the new format; once they are all upgraded, the entire system can be switched to the new format, whether larger or smaller.
类型长度值 (TLV) 格式是另一种广泛使用的数据编组问题解决方案。图 2-2显示了中间系统到中间系统 (IS-IS) 路由协议的示例。
The Type Length Value (TLV) format is another widely used solution to the problem of marshaling data. Figure 2-2 shows an example from the Intermediate System to Intermediate System (IS-IS) routing protocol.
在图2-2中,一个数据包由一个报头(通常是固定长度)和一组TLV组成。每个 TLV 均根据其类型代码进行格式化。在本例中,显示了两种 TLV 类型(IS-IS 中还有许多其他类型;这里使用两种进行说明)。第一种类型是 135,它携带互联网协议版本 4 (IPv4) 信息。该类型有多个字段,其中一些字段是固定长度的,例如度量。然而,其他的,例如前缀,是可变长度的。该字段的长度取决于 TLV 内某些其他字段中放置的值。在这种情况下,前缀长度字段决定前缀字段的长度。还有 subTLV,其格式类似,并携带与此 IPv4 信息相关的信息。类型 236 与 135 类似,但它携带 IPv6 而不是 IPv4 信息。
In Figure 2-2, a packet consists of a header, which is normally fixed length, and then a set of TLVs. Each TLV is formatted based on its type code. In this case, there are two TLV types shown (there are many other types in IS-IS; two are used for illustration here). The first type is a 135, which carries Internet Protocol version 4 (IPv4) information. This type has several fields, some of which are fixed length—such as the metric. Others, however, such as the prefix, are variable length; the length of the field depends on the value placed in some other field within the TLV. In this case, the prefix length field determines the length of the prefix field. There are also subTLVs, which are similarly formatted, and carry information associated with this IPv4 information. The type 236 is similar to the 135, but it carries IPv6, rather than IPv4, information.
本质上,TLV 可以被视为较大数据包中携带的一组完整的自包含信息。TLV 由三部分组成:
Essentially, the TLV can be considered a complete set of self-contained information carried within a larger packet. The TLV consists of three parts:
• 类型代码,描述数据的格式
• The type code, which describes the format of the data
• 长度,描述数据的总长度
• The length, which describes the total length of the data
• 值或数据本身
• The value, or the data itself
基于 TLV 的格式不如固定长度格式紧凑,因为它们在数据包本身内携带更多元数据。数据中携带的类型和长度信息提供了有关在字典中查找格式信息的信息,以及有关要使用的语法的信息(每个字段的格式设置方式等)。TLV 格式权衡了更改协议所承载的信息格式的能力,而不需要每个设备都进行升级,或者允许某些实现选择不支持所有可能的 TLV,与通过线路承载的附加元数据进行权衡。
TLV-based formats are less compact than fixed length formats because they carry more metadata within the packet itself. The type and length information carried in the data provides the information about where to look in the dictionary for information about the formatting, as well as information about the grammar to use (how each field is formatted, etc.). TLV formats trade off the ability to change the formatting of the information being carried by the protocol without requiring every device to upgrade, or allowing some implementations to choose not to support every possible TLV, against the additional metadata carried across the wire.
TLV 通常被认为是一种在协议中编组数据的非常灵活的方式;你会发现这个概念几乎无处不在。
TLVs are generally considered a very flexible way of marshaling data in protocols; you will find this concept to be almost ubiquitous.
固定长度字段的主要问题之一是字段定义的固定性;如果要修改固定长度字段协议,则需要更改版本号并修改数据包,或者必须创建具有不同字段编码的新数据包类型。TLV 格式化通过在传输数据时包含内联元数据来解决此问题,但代价是携带更多信息并降低紧凑性。共享编译字典试图通过将字典放在可共享文件(或库)中而不是放在规范中来解决此问题。图 2-3说明了该过程。
One of the major problems with fixed length fields is the fixedness of the field definitions; if you want to modify a fixed length field protocol, you need to bump the version number and modify the packet, or you must create a new packet type with different encodings for the fields. TLV formatting solves this by including metadata inline, with the data being transmitted, at the cost of carrying more information and reducing compactness. Shared compiled dictionaries attempt to solve this problem by placing the dictionary in a sharable file (or library) rather than in a specification. Figure 2-3 illustrates the process.
在图 2-3中,该过程从开发人员构建数据结构开始,以整理要通过网络传输的某些特定数据集。一旦构建了数据结构,它就会被编译成函数,或者可能被复制到函数库中 (1),然后复制到接收器 (2)。然后接收器使用该库编写应用程序来处理该数据 (3)。在发送器端,原始数据被编码为格式 (4),然后通过网络通过协议传送到接收器 (5)。接收器使用其共享的数据格式副本 (6) 来解码数据,并将解码后的信息传递给接收应用程序 (7)。
In Figure 2-3, the process begins with a developer building a data structure to marshal some particular set of data to be transferred across the network. Once the data structure has been built, it is compiled into a function, or perhaps copied into a library of functions (1), and copied over to the receiver (2). The receiver then uses this library to write an application to process this data (3). On the transmitter side, the raw data is encoded into the format (4), and then carried by a protocol across the network to the receiver (5). The receiver uses its shared copy of the data format (6) to decode the data, and pass the decoded information to the receiving application (7).
这种系统结合了基于 TLV 模型的灵活性和固定现场协议的紧凑性。虽然字段的长度是固定的,但字段定义的给出方式允许在需要更改封送格式时进行快速、灵活的更新。只要共享库与使用数据的应用程序解耦,就可以通过分发原始数据结构的新版本来更改字典和语法。
This kind of system combines the flexibility of the TLV-based model with the compactness of a fixed field protocol. While the fields are fixed length, the field definitions are given in a way that allows for fast, flexible updates when the marshaling format needs to be changed. So long as the shared library is decoupled from the application using the data, the dictionary and grammar can be changed by distributing a new version of the original data structure.
如果分发新版本的数据结构,是否需要标志日?不必要。如果数据结构中包含版本号,则接收方可以将接收到的数据与正确的数据结构进行匹配,则系统中可以同时存在该数据结构的多个版本。一旦没有发现使用较旧数据格式的发送者,则可以在整个系统中安全地丢弃旧结构。
Would a flag day be required if a new version of the data structure is distributed? Not necessarily. If a version number is included in the data structure, so the receiver could match the received data with the correct data structure, then multiple versions of the data structure could exist in the system at one time. Once no sender is found using an older data format, the older structure can be safely discarded throughout the entire system.
笔记
Note
gRPC 是编译共享库封送系统的一个示例;有关资源,请参阅“进一步阅读”部分。
gRPC is an example of a compiled shared library marshaling system; see the “Further Reading” section for resources.
笔记
Note
虽然固定格式和 TLV 系统依赖于开发人员阅读规范并编写代码作为共享语法和字典的一种形式,但共享数据结构系统(如本节所述)依赖于以其他方式分发的共享字典。有许多不同的方法可以做到这一点;例如,可以将新版本的软件分发给所有发送者和接收者,或者可以使用某种形式的分布式数据库来确保所有发送者和接收者收到更新的数据字典,或者专门管理编组的应用程序的某些部分数据可以被分发并与生成和使用数据的应用程序配对。某些此类系统将共享字典传输作为其初始会话设置的一部分。这些都有可能,
While the fixed format and TLV systems count on the developers reading the specifications, and writing code as a form of sharing the grammar and dictionary, shared data structure systems, as described in this section, count on the shared dictionary being distributed in some other way. There are many different ways this could be done; for instance, a new version of software can be distributed to all the senders and receivers, or some form of distributed database can be used to ensure all the senders and receivers receive the updated data dictionaries, or some part of an application that specifically manages marshaling data can be distributed and paired with an application that generates and consumes the data. Some systems of this kind transfer the shared dictionary as part of their initial session setup. All of these are possible, and outside the scope of this present text.
没有数据传输介质可以被认为是完美的。如果传输介质是共享的,例如射频 (RF),则可能会出现干扰,甚至数据报冲突。这是多个发送者尝试同时传输信息的情况。结果是一条乱码消息,目标接收者无法理解。即使是专用介质,例如点对点海底光缆,也可能因电缆退化或点事件而出现错误,甚至是看似疯狂的事件,例如太阳耀斑引起的辐射,进而干扰数据通过铜缆传输。
No data transmission medium can be assumed to be perfect. If the transmission medium is shared, like Radio Frequency (RF), there is the possibility of interference, or even datagram collisions. This is where more than one sender attempts to transmit information simultaneously. The result is a garbled message that cannot be understood by the intended receiver. Even a dedicated medium, such as a point-to-point undersea optical (lightwave) fiber cable, can experience errors due to cable degradation or point events—even seemingly insane events, such as solar flares causing radiation, which in turn interferes with data transmission through a copper cable.
There are two key questions a network transport must answer in the area of errors:
• 如何检测数据传输中的错误?
• How can errors in the transmission of data be detected?
• 对于数据传输中的错误,网络应该采取什么措施?
• What should the network do about errors in data transmission?
以下部分考虑这些问题的一些可能答案。
The following sections consider some of the possible answers to these questions.
处理错误的第一步是检测错误,无论这些错误是由于传输介质故障、路径上交换设备中的内存损坏还是任何其他原因造成的。当然,问题是,当接收器检查其接收到的数据时,没有任何数据可以与数据进行比较来检测错误。
The first step in dealing with errors, whether they are because of a transmission media failure, memory corruption in a switching device along the path, or any other reason, is to detect the error. The problem is, of course, when a receiver examines the data it receives, there is nothing to compare the data to in order to detect the error.
奇偶校验是最简单的检测机制。存在两种互补的奇偶校验算法。通过奇偶校验,每个数据块都会添加一位附加位。如果数据块中的位总和为偶数(即,如果数据块中存在偶数个 1 位),则附加位将设置为 0。这将保留块的偶校验状态。如果位之和为奇数,则附加位设置为 1,从而将整个块设置为偶校验状态。奇校验使用相同的附加位策略,但它要求块具有奇校验(奇数个 1 位)。
Parity checks are the simplest detection mechanisms. Two complementary parity checking algorithms exist. With even parity checking, one additional bit is added to each block of data. If the sum of bits in the block of data is even—that is, if there are an even number of 1 bits in the data block—the additional bit is set to 0. This preserves the even parity state of the block. If the sum of bits is odd, the additional bit is set to 1, which sets the entire block to an even parity state. Odd parity uses the same additional bit strategy, but it requires the block to have odd parity (an odd number of 1 bits).
例如,计算这四个八位字节数据的偶数和奇数奇偶校验:
As an example, calculate even and odd parity for these four octets of data:
00110011 00111000 00110101 00110001
00110011 00111000 00110101 00110001
简单地数一下数字,就会发现该数据中有 14 个1和 18 个0。为了使用奇偶校验进行错误检测,您需要向数据中添加一位,要么使新扩大的一组位中1的总数为偶数(对于偶奇偶校验),要么使奇数奇偶校验(奇数奇偶校验)为奇数。例如,如果在这种情况下要添加偶校验位,则附加位应设置为 0。这是因为1的数量已经是偶数。将附加奇偶校验位设置为 0 不会再添加一个1 ,因此无论1的总数是偶数还是奇数,都不会改变。对于偶校验,最终的位组是
Simply counting the digits reveals there are 14 1s and 18 0s in this data. To provide for error detection using a parity check, you add one bit to the data, either making the total number of 1s in the newly enlarged set of bits even for even parity, or odd for odd parity. For instance, if you want to add an even parity bit in this case, the additional bit should be set to 0. This is because the number of 1s is already an even number. Setting the additional parity bit to 0 will not add another 1, and hence will not change whether the total number of 1s is even or odd. For even parity, then, the final set of bits is
00110011 00111000 00110101 00110001 0
00110011 00111000 00110101 00110001 0
另一方面,如果您想向这组位添加单个奇校验位,则需要将附加奇偶校验位设置为 1 ,因此现在有 15 个1,而不是 14 个。对于奇校验,最终的奇偶校验位为 1。位组是
On the other hand, if you wanted to add a single bit of odd parity to this set of bits, you would need to make the additional parity bit a 1, so there are now 15 1s rather than 14. For odd parity, the final set of bits is
00110011 00111000 00110101 00110001 1
00110011 00111000 00110101 00110001 1
为了检查数据在传输过程中是否被损坏或改变,接收器可以简单地记录正在使用的是偶数还是奇数奇偶校验,添加1的数量,并丢弃奇偶校验位。如果1的数量与所使用的奇偶校验类型(偶数或奇数)不匹配,则数据已损坏;否则,数据看起来与最初传输的数据相同。
To check whether or not the data has been corrupted or changed in transit, the receiver can simply note whether even or odd parity is in use, add the number of 1s, and discard the parity bit. If the number of 1s does not match the kind of parity in use (even of odd), the data has been corrupted; otherwise, the data appears to be the same as what was originally transmitted.
当然,这个新位与原始位一起传输。如果奇偶校验位本身被某种方式损坏会发生什么?这实际上是可以的;假设奇偶校验已到位,并且发送器发送
This new bit is, of course, transmitted along with the original bits. What happens if the parity bit itself is somehow corrupted? This is actually okay; assume even parity checking is in place, and a transmitter sends
00110011 00111000 00110101 00110001 0
00110011 00111000 00110101 00110001 0
然而,接收方收到
The receiver, however, receives
00110011 00111000 00110101 00110001 1
00110011 00111000 00110101 00110001 1
奇偶校验位本身已从 0 翻转为 1。接收器将计算 1 ,确定有 15 个;由于偶校验正在使用,因此接收到的数据将被标记为有错误,即使实际上没有错误。奇偶校验可能对失败过于敏感,但在错误检测的情况下最好还是谨慎一些。
The parity bit itself has been flipped from a 0 to a 1. The receiver will count the 1s, determining there are 15; since even parity checking is in use, the received data will be flagged as having an error even though it does not. The parity check is potentially too sensitive to failures, but it is better to err on the side of caution in the case of error detection.
奇偶校验有一个问题:它只能检测到传输信号中的一位翻转。例如,如果使用偶校验,并且发送器发送
There is one problem with the parity check: it can detect only a single bit flip in the transmitted signal. For instance, if even parity is in use, and the transmitter sends
00110011 00111000 00110101 00110001 0
00110011 00111000 00110101 00110001 0
然而,接收方收到
The receiver, however, receives
0011001 0 00111000 00110101 0011000 0 0
00110010 00111000 00110101 00110000 0
接收方统计1的个数,发现有12个;由于系统使用偶校验,接收器将假设数据是正确的并正常处理它。然而,以粗体标记的两个位均已损坏。如果修改了任意组合的偶数位,则奇偶校验无法检测到更改;只有当变化涉及奇数位时,奇偶校验才能检测到数据的修改。
The receiver will count the number of 1s and find it is 12; since the system is using even parity, the receiver will assume the data is correct and process it normally. However, the two bits marked out in bold have both been corrupted. If an even number of bits, in any combination, is modified, the parity check cannot detect the change; only when the change involves an odd number of bits can the parity check detect the modification of the data.
循环冗余校验 (CRC) 可以通过在整个数据集上循环使用除法(而不是加法)(一次一小部分)来检测传输数据中更广泛的修改。通过示例是理解 CRC 计算方式的最佳方式。CRC 计算从多项式开始,如图2-4所示。
The Cyclic Redundancy Check (CRC) can detect a wider range of modifications in transmitted data by using division (rather than addition) in cycles across the entire data set, one small piece at a time. Working through an example is the best way to understand how a CRC is calculated. A CRC calculation begins with a polynomial, as shown in Figure 2-4.
在图 2-4中,三项多项式 x 3 + x 2 + 1 被扩展为包含所有项——包括前面有 0 的项(因此不会影响以下结果)计算与 x) 的值无关。然后将这四个系数用作二进制计算器,用于计算 CRC。
In Figure 2-4, a three-term polynomial, x3 + x2 + 1, is expanded to include all the terms—including terms preceded by 0 (and hence do not impact the result of the calculation regardless of the value of x). The four coefficients are then used as a binary calculator, which will be used to calculate the CRC.
要执行 CRC,请从原始二进制数据集开始,并添加三个额外位(因为没有系数的原始多项式具有三项;因此这称为三位 CRC 校验),如下所示:
To perform the CRC, begin with the original binary data set, and add three extra bits (because the original polynomial, without the coefficients, has three terms; hence this is called a three-bit CRC check), as shown here:
10110011 00111001(原始数据)
10110011 00111001 000(添加 CRC 位)
10110011 00111001 (original data)
10110011 00111001 000 (with the added CRC bits)
需要这三位来确保原始数据中的所有位都包含在 CRC 中;当 CRC 在原始数据中从左向右移动时,只有包含这些填充位,才会包含原始数据中的最后一位。现在从左边四位开始(因为四个系数被表示为四位)。使用异或 (XOR) 运算将最左边的位与 CRC 位进行比较,并保存结果,如下所示:
These three bits are required to ensure all the bits in the original data are included in the CRC; as the CRC moves from left to right across the original data, the last bits in the original data will be included only if these padding bits are included. Now begin at the left four bits (because the four coefficients are represented as four bits). Use the Exclusive OR (XOR) operation to compare the far-left bits against the CRC bits, and save the result, as shown here:
10110011 00111001 000(填充数据)
1101(CRC 校验位)
----
01100011 00111001 000(异或的结果)
10110011 00111001 000 (padded data)
1101 (CRC check bits)
----
01100011 00111001 000 (result of the XOR)
对两个二进制数字进行异或运算,如果两个数字匹配,则结果为 0,如果不匹配,则结果为 1。
XOR’ing two binary digits results in a 0 if the two digits match, and a 1 if they do not.
检查位(称为除数)向右移动一位(此处可以跳过一些步骤)并重复该操作,直到到达数字末尾:
The check bits, called a divisor, are moved one bit to the right (some steps can be skipped here) and the operation is repeated until the end of the number is reached:
10110011 00111001 000
1101
01100011 00111001 000 1101 00001011 00111001 000 1101
00000110 00111001
000
110
1
00000000
101 11001 000 1101 00000000 01101001 000 1101
00000000
00000001
000
1
101
00000000 00000000 101
10110011 00111001 000
1101
01100011 00111001 000
1101
00001011 00111001 000
1101
00000110 00111001 000
110 1
00000000 10111001 000
1101
00000000 01101001 000
1101
00000000 00000001 000
1 101
00000000 00000000 101
CRC 位于最后三位,最初是作为填充添加的;这是移动原始数据加上原始填充的除法过程的“余数”。接收器可以通过保留 CRC 位(本例中为 101)并使用数据的原始除数来确定数据是否已更改,如下所示:
The CRC is in the final three bits that were originally added on as padding; this is the “remainder” of the division process of moving across the original data plus the original padding. It is simple for the receiver to determine whether the data has been changed by leaving the CRC bits in place (101 in this case), and using the original divisor across the data, as shown here:
10110011 00111001 101
1101
01100011 00111001 101
1101 00001011 00111001 101 1101
00000110 00111001
101
110 1
00000000 101 11001 101
1101 00000000
01101001 101
1101
00000000
00000001 101
1 101
00000000 00000000 000
10110011 00111001 101
1101
01100011 00111001 101
1101
00001011 00111001 101
1101
00000110 00111001 101
110 1
00000000 10111001 101
1101
00000000 01101001 101
1101
00000000 00000001 101
1 101
00000000 00000000 000
如果数据未更改,则此操作的结果应始终为 0。如果更改了某个位,则结果将不是 0,如下所示:
If the data has not been changed, the result of this operation should always result in 0. If a bit has been changed, the result will not be 0, as shown here:
10110011 0011100 0 000
1101
01100011 0011100 0 000 1101 00001011 0011100 0 000
1101
00000110 0011100
0 000 110
1
00000000
1011100 0 000
1101
00000000 0110100
0 000 1101
00000000 0000000 0 000 1
101
00000000 0000000 1 000
10110011 00111000 000
1101
01100011 00111000 000
1101
00001011 00111000 000
1101
00000110 00111000 000
110 1
00000000 10111000 000
1101
00000000 01101000 000
1101
00000000 00000000 000
1 101
00000000 00000001 000
CRC 可能看起来是一个复杂的运算,但它发挥了计算机的强项——有限长度的二进制运算。如果 CRC 的长度设置为与常见处理器中的标准小寄存器相同,例如 8 位,则计算 CRC 的过程相当简单且快速。与前面描述的奇偶校验不同,CRC 检查具有抵抗多位更改的优点。
The CRC might seem like a complex operation, but it plays to a computer’s strong points—finite length binary operations. If the length of the CRC is set the same as a standard small register in common processors, say eight bits, calculating the CRC is a fairly straightforward and quick process. CRC checks have the advantage of being resistant to multibit changes, unlike the parity check described previously.
然而,检测错误只是问题的一半。一旦检测到错误,运输系统应该做什么?基本上有三种选择。
Detecting an error is only half of the problem, however. Once the error is detected, what should the transport system do? There are essentially three options.
传输系统可以简单地丢弃数据。在这种情况下,传输有效地将如何处理错误的责任转移给更高级别的协议或者应用程序本身。由于某些应用程序可能需要没有错误的完整数据集(例如文件传输系统或金融交易),因此它们可能有某种方法来发现任何丢失的数据并重新传输它。不关心少量丢失数据(例如语音流)的应用程序可以简单地忽略丢失的数据,在接收器处重建信息并尽可能在给定丢失信息的情况下重建信息。
The transport system can simply throw the data away. In this case, the transport is effectively transferring the responsibility of what to do about the error up to higher-level protocols or perhaps the application itself. As some applications may need a complete data set with no errors (think a file transfer system, or a financial transaction), they will likely have some way to discover any missing data and retransmit it. Applications that do not care about small amounts of missing data (think a voice stream) can simply ignore the missing data, reconstructing the information at the receiver as well as possible given the missing information.
传输系统可以向发送器发出有错误的信号,并让发送器决定如何处理该信息(通常会重传错误的数据)。
The transport system can signal the transmitter that there is an error, and let the transmitter decide what to do with this information (generally the data in error will be retransmitted).
传输系统不仅可以丢弃数据,还可以在原始传输中包含足够的信息,并确定错误所在并尝试纠正错误。这称为前向纠错(FEC)。汉明码是最早开发的 FEC 机制之一,也是最容易解释的机制之一。汉明码最好通过示例来解释;用表2-2来说明。
The transport system can go beyond throwing data away by including enough information in the original transmission and determine where the error is and attempt to correct it. This is called Forward Error Correction (FEC). Hamming codes, one of the first FEC mechanisms developed, is also one of the simplest to explain. The Hamming code is best explained by example; Table 2-2 will be used to illustrate.
Table 2-2 An Illustration of the Hamming Code
1 1 |
2 2 |
3 3 |
4 4 |
5 5 |
6 6 |
7 7 |
8 8 |
9 9 |
10 10 |
11 11 |
12 12 |
|
0001 0001 |
0010 0010 |
0011 0011 |
0100 0100 |
0101 0101 |
0110 0110 |
0111 0111 |
1000 1000 |
1001 1001 |
1010 1010 |
1011 1011 |
1100 1100 |
|
P1 P1 |
P2 P2 |
D1 D1 |
P4 P4 |
D2 D2 |
D3 D3 |
D4 D4 |
P8 P8 |
D5 D5 |
D6 D6 |
D7 D7 |
D8 D8 |
|
|
|
1 1 |
|
0 0 |
1 1 |
1 1 |
|
0 0 |
0 0 |
1 1 |
1 1 |
|
P1 P1 |
1 1 |
|
X X |
|
X X |
|
X X |
|
X X |
|
X X |
|
P2 P2 |
|
0 0 |
X X |
|
|
X X |
X X |
|
|
X X |
X X |
|
P4 P4 |
|
|
|
1 1 |
X X |
X X |
X X |
|
|
|
|
X X |
P8 P8 |
|
|
|
|
|
|
|
0 0 |
X X |
X X |
X X |
X X |
表 2-2中:
In Table 2-2:
• 12 位空间中作为2 的幂的每一位(1、2、4、6、8 等)和第一位均被保留为奇偶校验位。
• Each bit in the 12-bit space that is a power of two (1, 2, 4, 6, 8, etc.) and the first bit are set aside as parity bits.
• 要使用FEC 保护的8 位数字10110011 已分布在12 位空间中的剩余位上。
• The 8-bit number to be protected with FEC, 10110011, has been distributed across the remaining bits in the 12-bit space.
• 将每个奇偶校验位设置为0,然后通过将二进制位数与奇偶校验位设置的位相同的位置中的1的数量相加来计算每个奇偶校验位的奇偶校验。具体来说:
• Each parity bit is set to 0, and then parity is calculated for each parity bit by adding the number of 1s in positions where the binary bit number has the same bit set as the parity bit. Specifically:
• P1 在其位数中设置了最右边的位;数字空间中也具有最右位集的其他位也包含在奇偶校验计算中(请参阅表中的第二行以查找具有最右位集的数字中的所有位位置)。这些在表中的 P1 行中用X表示。1的总数为奇数 3,因此 P1 位设置为1(本例使用偶校验)。
• P1 has the far-right bit set in its bit number; the other bits in the number space that also have the far right bit set are included in the parity calculation (see the second row in the table to find all the bit positions in the number with the far-right bit set). These are indicated in the table with an X in the P1 row. The total number of 1s is an odd number, 3, so the P1 bit is set to 1 (this example is using even parity).
• P2 具有右组中的第二位;数字空间中从右数第二位设置的其他位包含在奇偶校验计算中,如表中 P2 行中的X所示。1的总数是偶数,4,因此 P2 位设置为 0。
• P2 has the second bit from the right set; the other bits in the number space that have the second from the right bit set are included in the parity calculation, as indicated with an X in the P2 row of the table. The total number of 1s is an even number, 4, so the P2 bit is set to 0.
• P4 具有从右侧算起的第三位设置,因此具有从右侧算起的第三位设置的其他位在其位置编号中,如P3 行中的X所示。标记的列中有奇数个1,因此P4奇偶校验位设置为1。
• P4 has the third bit from the right set, so the other bits that have the third from the right bit set in their position numbers, as indicated with an X in the P3 row. There are an odd number of 1s in the marked columns, so the P4 parity bit is set to 1.
为了确定是否有信息发生更改,接收方可以按照发送方计算奇偶校验位的方式检查奇偶校验位;任何一组中1的总数应为偶数,包括奇偶校验位。如果其中一个数据位已被翻转,则接收器永远不会发现单个奇偶校验错误,因为数据中的每个位位置都被多个奇偶校验位覆盖。为了发现哪个数据位不正确,接收器将错误的奇偶校验位的位置相加;结果是已翻转的位位置。例如,如果位置 9 中的位(即第五个数据位)被翻转,则奇偶校验位 P1 和 P8 都将出错。在这种情况下,8 + 1 = 9,因此位置 9 中的位有错误,翻转它可以纠正数据。如果单个奇偶校验位有错误,例如 P1或P8,则该奇偶校验位已被翻转,而数据本身是正确的。
To determine if any information has changed, the receiver can check the parity bits in the same way the sender has calculated them; the total number of 1s in any set should be an even number, including the parity bit. If one of the data bits has been flipped, the receiver should never find a single parity error, because each of the bit positions in the data is covered by multiple parity bits. To discover which data bit is incorrect, the receiver adds the positions of the parity bits that are in error; the result is the bit position that has been flipped. For instance, if the bit in position 9, which is the fifth data bit, is flipped, then parity bits P1 and P8 would be both in error. In this case, 8 + 1 = 9, so the bit in position 9 is in error and flipping it would correct the data. If a single parity bit is in error—for example, P1 or P8—then it is that parity bit which has been flipped, and the data itself is correct.
虽然汉明码很巧妙,但它无法检测到许多位翻转模式。更现代的代码(例如Reed-Solomon)可以检测并纠正更广泛的错误条件,同时向数据流添加更少的附加信息。
While the Hamming code is ingenious, there are many bit flip patterns it cannot detect. A more modern code, such as Reed-Solomon, can detect and correct a wider range of error conditions while adding less additional information to the data stream.
笔记
Note
整个通信领域使用了大量不同类型的 CRC 和纠错码。CRC 校验按校验中使用的位数(填充位数,或更确切地说多项式的长度)以及在某些情况下的特定应用进行分类。例如,通用串行总线使用 5 位 CRC (CRC-5-USB);全球移动通信系统 (GSM) 是一种广泛使用的蜂窝电话标准,使用 CRC-3-GSM;码分多址 (CDMA) 是另一种广泛使用的蜂窝电话标准,使用 CRC-6-CDMA2000A、CRC-6-CDMA2000B 和 CRC-30;一些用于将车辆中的各个组件连接在一起的车域网 (CAN) 使用 CRC-17-CAN 和 CRC-21-CAN。这些不同的 CRC 函数中的一些不是单个函数,而是一类或一族函数,
There are a large number of different kinds of CRC and error correction codes used throughout the communications world. CRC checks are classified by the number of bits used in the check (the number of bits of padding, or rather the length of the polynomial), and, in some cases, the specific application. For instance, the Universal Serial Bus uses a 5-bit CRC (CRC-5-USB); the Global System for Mobile Communications (GSM), a widely used cellular telephone standard, uses CRC-3-GSM; Code Division Multi-Access (CDMA), another widely used cellular telephone standard, uses CRC-6-CDMA2000A, CRC-6-CDMA2000B, and CRC-30; and some car area networks (CANs), used to tie together various components in a vehicle, use CRC-17-CAN and CRC-21-CAN. Some of these various CRC functions are not a single function, but rather a class, or family, of functions, with many different codes and options within them.
你走进一个房间并喊道:“乔!” 你的朋友乔转身开始谈论政治和宗教(当然,在任何礼貌的谈话中,这两个话题都是禁止的)。这种使用单一媒介(声音传播的空气)来对一个人讲话的能力,即使许多其他人同时使用同一媒介进行其他对话,在网络工程中就是所谓的多路复用。更正式地说:
You walk into a room and shout, “Joe!” Your friend, Joe, turns around and begins a conversation on politics and religion (the two forbidden topics, of course, in any polite conversation). This ability to use a single medium (the air through which your voice travels) to address one person, even though many other people are using the same medium for other conversations at the same time, is what is called, in network engineering, multiplexing. More formally:
多路复用用于允许连接到网络的多个实体通过共享网络进行通信。
Multiplexing is used to allow multiple entities attached to the network to communicate over a shared network.
为什么这里使用实体一词而不是主机?回到“与乔对话”的例子,想象一下你与乔交流的一种方式是通过他十几岁的孩子,他只发短信(从不说话)。事实上,乔是一个几百人到几千人的家庭的一员,整个家庭的所有沟通都必须通过这个少年,家里的每个人都同时进行多个对话,有时与家人讨论不同的话题。同一个人。这个可怜的青少年必须非常快地发短信,并在脑海中保留大量信息,例如“乔正在与玛丽进行四次对话”,并且必须将每次对话中的信息与其他对话完全分开。这更接近网络复用的实际工作原理;考虑:
Why is the word entities used here instead of hosts? Returning to the “conversation with Joe” example, imagine the one way you can communicate with Joe is through his teenaged child, who only texts (never talks). In fact, Joe is part of a family of several hundred to several thousand people, and all the communications for this entire family must come through this one teenager, and each person in the family has multiple conversations running concurrently, sometimes on different topics with the same person. The poor teenager must text very quickly, and keep a lot of information in her head, like “Joe is having four conversations with Mary,” and must keep the information in each conversation completely separate from the other. This is closer to how network multiplexing really works; consider:
• 可能有数百万(或数十亿)台主机连接到单个网络,所有主机共享同一物理网络以相互通信。
• There could be millions (or billions) of hosts connected to a single network, all sharing the same physical network to communicate with one another.
• 这些主机中的每一个实际上都包含许多应用程序,可能有数百个,每个应用程序都可以与连接到网络的任何其他主机上的数百个应用程序中的任何一个进行通信。
• Each of these hosts actually contains many applications, possibly several hundred, each of which can communicate with any of the hundreds of applications on any other host connected to the network.
• 事实上,这些应用程序中的每一个都可能与网络中任何其他主机上运行的任何其他应用程序进行多次对话。
• Each of these applications may, in fact, have several conversations to any other application running on any other host in the network.
如果这听起来有点复杂,那是因为事实确实如此。那么,本节需要回答的问题是:
If this is starting to sound complicated, that is because it is. The question this section needs to answer, then, is this:
主机如何通过计算机网络进行有效的多路复用?
How do hosts multiplex effectively over a computer network?
以下各节考虑该领域最常用的解决方案,以及与此基本问题相关的一些有趣问题,例如多播和任播。
The following sections consider the most commonly used solutions in this space, as well as some interesting problems tied up in this basic problem, such as multicast and anycast.
计算机网络使用一系列分层排列的地址来解决这些问题;如图 2-5所示。
Computer networks use a series of hierarchically arranged addresses to solve these problems; Figure 2-5 illustrates.
在图 2-5中,显示了四个寻址级别:
In Figure 2-5, there are four levels of addressing shown:
• 在物理链路级别,存在允许两个设备单独寻址特定设备的接口地址。
• At the physical link level, there are interface addresses that allow two devices to address a particular device individually.
• 在主机级别,存在允许两个主机直接寻址特定主机的主机地址。
• At the host level, there are host addresses that allow two hosts to address a particular host directly.
• 在进程级别,端口号与主机地址相结合,允许两个进程对特定设备上的特定进程进行寻址。
• At the process level, there are port numbers that, combined with the host address, allow two processes to address a particular process on a particular device.
• 在会话级别,源端口、目标端口、源地址和目标地址的集合可以组合起来以唯一地标识特定的会话或流。
• At the conversation level, the set of source port, destination port, source address, and destination address can be combined to uniquely identify a particular conversation, or flow.
这个图表和解释看起来非常干净。在现实生活中,事情要混乱得多。在最广泛部署的寻址方案(互联网协议(IP))中,没有主机级地址。相反,每个接口都有逻辑和物理地址。
This diagram and explanation appear very clean. In real life, things are much messier. In the most widely deployed addressing scheme, the Internet Protocol (IP), there are no host-level addresses. Instead, there are logical and physical addresses on a per interface basis.
笔记
Note
IP 和 IP 寻址将在第 5 章“高层数据传输”中更详细地讨论。
IP and IP addressing will be considered in more detail in Chapter 5, “Higher Layer Data Transports.”
多路复用和多路复用标识符(地址)在网络中分层堆叠。
Multiplexing and multiplexing identifiers (addresses) are stacked hierarchically on top of one another in a network.
笔记
Note
在某些层之间将一种地址与另一种地址相关联的机制将在第 6 章“层间发现”中更全面地考虑。
Mechanisms that associate one kind of address with another between some layers will be considered more fully in Chapter 6, “Interlayer Discovery.”
然而,在某些情况下,您希望一次向多个主机发送流量;对于这些情况,有组播和任播。下面几节将讨论这两种特殊类型的寻址。
There are some situations, however, in which you want to send traffic to more than one host at a time; for these situations, there are multicast and anycast. These two special kinds of addressing will be considered in the following sections.
笔记
Note
这个简短的解释并不能真正正确地描述可用于构建多播树的解决方案的整个范围。请参阅本章末尾的“进一步阅读”部分,了解该领域需要考虑的更多材料。
This short explanation cannot really do justice to the entire scope of solutions available to build multicast trees; see the “Further Reading” section at the end of the chapter for more material to consider in this area.
如果您有一个如图 2-6所示的网络,并且您需要 A 将相同的内容分发给 G、H、M 和 N,您会如何做?
If you have a network like the one shown in Figure 2-6, and you need A to distribute the same content to G, H, M, and N, how would you go about doing this?
您可以生成四个流量副本,使用正常(单播)转发将一个流发送到每个接收器,或者您可以以某种方式将流量发送到网络知道要复制的单个地址,以便所有四台主机都收到一个副本。后一个选项称为多播,这意味着使用单个地址将流量传输到多个接收者。多播要解决的关键问题是在流量通过网络时转发和复制流量,以便每个对流感兴趣的接收者都会收到一份副本。
You could either generate four copies of the traffic, sending one stream to each of the receivers using normal (unicast) forwarding, or you could somehow send the traffic to a single address that the network knows to replicate so all four hosts receive a copy. This latter option is called multicast, which means using a single address to transmit traffic to multiple receivers. The key problem to solve in multicast is to forward and replicate traffic as it passes through the network so each receiver who is interested in the stream will receive a copy.
笔记
Note
有兴趣从多播源接收数据包流的设备集合称为多播组。这可能有点令人困惑,因为用于描述多播流的地址在某些情况下也称为多播组。这两种用途实际上是可以互换的,因为对接收特定组播数据包感兴趣的一组设备将加入该组播组,这实际上意味着监听特定的组播地址。
The set of devices interested in receiving a stream of packets from a multicast source is called a multicast group. This can be a bit confusing because the address used to describe the multicast stream is also called a multicast group in some situations. The two uses are practically interchangeable in that the set of devices interested in receiving a particular set of multicast packets will join the multicast group, which, in effect, means listening to a particular multicast address.
笔记
Note
在多播流量是双向的情况下,这个问题更难解决。例如,假设需要与图2-6所示网络中除N之外的每台主机建立一个多播组,并且进一步将发送到多播组地址的任何多播传送到多播组内的每台主机。
In cases where the multicast traffic is bidirectional, this problem is much more difficult to solve. For instance, assume there is a requirement to build a multicast group with every host in the network shown in Figure 2-6 except N, and further that any multicast transmitted to the multicast group’s address be delivered to every host within the multicast group.
组播要解决的关键问题可以分为两个问题:
The key problem for multicast to solve can be broken into two problems:
• 如何发现哪些设备想要接收传输到多播组的流量副本?
• How do you discover which devices would like to receive a copy of traffic transmitted to the multicast group?
• 如何确定网络中的哪些设备应复制流量,以及它们应在哪些接口上发送副本?
• How do you determine which devices in the network should replicate the traffic, and on which interfaces they should send copies?
一种可能的解决方案是使用本地请求构建一棵树,多播流量应通过该树通过网络转发。此类系统的一个示例是协议独立组播 (PIM) 中的稀疏模式。在此过程中,每个设备发送其感兴趣的组播流的加入消息;这些连接在网络中向上游传递,直到到达发送者(通过多播流发送数据包的主机)。图2-7用于说明这一过程。
One possible solution is to use local requests to build a tree through which the multicast traffic should be forwarded through the network. An example of such a system is Sparse Mode in Protocol Independent Multicast (PIM). In this this process, each device sends a join message for the multicast streams it is interested in; these joins are passed upstream in the network until the sender (the host sending packets through the multicast stream) is reached. Figure 2-7 is used to illustrate this process.
在图2-7中:
In Figure 2-7:
1. A 正在向多播组(地址)发送一些流量;称之为Z。
1. A is sending some traffic to a multicast group (address); call it Z.
2. N 希望接收 Z 的副本,因此它向其上游路由器 D 发送请求(加入)以获取该流量的副本。
2. N would like to receive a copy of Z, so it sends a request (a join) to its upstream router, D, for a copy of this traffic.
3. D 没有此流量的来源,因此它向与其连接的路由器发送请求以获取此流量的副本;在这种情况下,路由器 D 将请求发送到的唯一路由器是 B。
3. D does not have a source for this traffic, so it sends a request to the routers it is connected to for a copy of this traffic; in this case, the only router D sends the request to is B.
在每一跳,接收请求的路由器会将接收请求的接口放入其出站接口列表 (OIL)中,并开始转发在任何其他接口上接收的给定多播组中接收的流量。这样就可以建立一条从流量的接收者到发起者的路径;这称为反向路径树。
At each hop, the router receiving the request will place the interface on which it received the request into its Outbound Interface List (OIL), and begin forwarding traffic received in the given multicast group received on any other interface. In this way, a path from the receiver to the originator of the traffic can be built; this is called a reverse path tree.
发现哪些主机有兴趣接收特定多播组的流量的第二个选项是通过某种注册服务器。每个想要接收流副本的主机都可以向服务器注册其愿望。主机可以通过多种方式发现服务器的存在,包括
A second option for discovering which hosts are interested in receiving traffic for a specific multicast group is through some sort of registration server. Each host that would like to receive a copy of the stream can register its desire with a server. There are several ways the host can discover the presence of the server, including
• 将组播组地址视为域名,通过查询组播组地址来查找注册服务器的地址
• Treating the multicast group address like a domain name, and looking up the address of the registration server by querying for the multicast group address
• 构建和维护组到本地表中服务器映射的列表或映射
• Building and maintaining a list, or mapping, of groups to servers mapping in a local table
• 使用某种形式的哈希算法根据多播组地址计算注册服务器
• Using some form of hash algorithm to compute the registration server from the multicast group address
注册可以由到服务器的路径上的设备跟踪,或者,一旦知道接收器和发送器的集合,服务器就可以向路径上的适当设备发出信号,哪些端口应该配置为复制和转发数据包。
The registrations can either be tracked by devices on the path to the server, or, once the set of receivers and transmitters is known, the server can signal the appropriate devices along the path which ports should be configured for replicating and forwarding packets.
多路复用解决方案面临的另一个问题是能够使用单个地址来寻址驻留在多个主机上实现的服务的特定实例。如图 2-8所示。
Another problem multiplexing solutions face is being able to address a specific instance of a service residing in implemented on multiple hosts using a single address. Figure 2-8 illustrates.
在图 2-8中,需要设计某些服务S来提高其性能。为了实现此目标,创建了该服务的第二个副本,这两个副本分别命名为 S1 和 S2。这两个服务副本运行在两台服务器 M 和 N 上。 Anycast 寻求解决的问题是:
In Figure 2-8, some service, S, needs to be designed to increase its performance. To accomplish this goal, a second copy of the service has been created, with the two copies being named S1 and S2. These two copies of the service are running on two servers, M and N. The problem anycast seeks to solve is this:
如何将客户定向到最佳的服务实例?
How can clients be directed to the most optimal instance of a service?
解决此问题的一种方法是将所有客户端定向到单个设备,并让负载均衡器根据客户端的拓扑位置、每个服务器的负载和其他因素将流量分配到服务器。然而,这种解决方案并不总是理想的。例如,如果负载均衡器无法处理想要访问不同副本的客户端生成的所有连接请求,该怎么办?服务?为了让负载均衡器能够跟踪服务各个副本的运行状况,网络会增加什么样的复杂性?
One way of solving this problem is to direct all the clients to a single device and have a load balancer split the traffic to the servers based on the topological location of the client, the load of each server, and other factors. This solution is not always ideal, however. For instance, what if the load balancer cannot handle all the connection requests generated by the clients who want to reach various copies of the service? What sorts of complexities are going to be added to the network to allow the load balancer to track the health of the various copies of the service?
任播通过为服务的每个副本分配相同的地址来解决此问题。在图2-8所示的网络中,M和N将使用相同的地址来提供对S1和S2的可达性。M 和 N 将分配和通告不同的地址,以提供对其他服务以及设备本身的可达性。
Anycast solves this problem by assigning the same address to each copy of the service. In the network illustrated in Figure 2-8, then, M and N would use the same address to provide reachability to S1 and S2. M and N would have different addresses assigned and advertised to provide reachability to other services, and to the devices themselves, as well.
H 和 K 是 M 和 N 之外的第一跳路由器,会将相同的地址通告到网络中。当C和D收到两条到达同一目的地的路由时,它们会选择度量最接近的路由。在这种情况下,如果同一网络中的每个链路都配置了相同的度量,则 C 会将源自 A 且目的地为服务地址的流量定向到 M。另一方面,D 将定向源自 B 的流量,并且目的地是服务的地址,朝向 N。如果服务的两个实例相距大约相同的距离,会发生什么?路由器将使用本地哈希算法选择两条路径之一。
H and K, the first hop routers beyond M and N, would advertise this same address into the network. When C and D receive two routes to the same destination, they will choose the closest route in terms of metrics. In this case, if every link in the same network is configured with the same metric, then C would direct traffic sourced from A, and destined to the service’s address, toward M. D, on the other hand, will direct traffic sourced from B, and destined to the service’s address, toward N. What happens if two instances of the service are about the same distance apart? The router will choose one of the two paths using a local hash algorithm.
笔记
Note
有关等成本多路径交换以及如何使用散列确保流中的每个数据包使用相同路径的更多信息,请参阅第7 章。即使在互联网中,路由通常也足够稳定,可以使用具有状态协议的选播解决方案。5
See Chapter 7 for more information about equal cost multipath switching, and how using a hash ensures the same path is used for each packet in a flow. Routing is generally stable enough, even in the Internet, to use anycast solutions with stateful protocols.5
任播通常用于必须通过配置大量服务器来支持单一服务来扩展的大规模服务。示例包括以下内容:
Anycast is often used for large-scale services that must scale by provisioning a lot of servers to support the single service. Examples include the following:
• 大多数大型域名服务(DNS) 系统服务器实际上是一组可通过任播地址访问的服务器。
• Most large-scale Domain Name Service (DNS) system servers are actually a set of servers accessible through an anycast address.
• 许多大规模基于网络的服务,特别是社交媒体和搜索,其中单个服务在大量边缘设备上实施。
• Many large-scale web-based services, particularly social media and search, where a single service is implemented on a large number of edge devices.
• 内容缓存服务通常使用选播来分发和提供信息。
• Content caching services often use anycast in distributing and serving information.
如果设计正确,选播可以提供有效的负载平衡以及最佳的服务性能。
Designed correctly, anycast can provide effective load balancing as well as optimal performance for services.
你还记得你的曾姨妈(或者是你远房的远房表姐吗?),她语速太快以至于你一个字都听不懂?有些计算机程序说话也太快了。图 2-9说明了这一点。
Do you remember your great aunt (or was it your second cousin once removed?) who talked so fast that you could not understand a word she was saying? Some computer programs talk too fast, too. Figure 2-9 illustrates.
在图2-9中:
In Figure 2-9:
• 在时间1 (T1),发送方大约传输四个数据包,接收方每处理三个数据包。接收器有一个五包缓冲区,用于存储未处理的信息;该缓冲区中有两个数据包。
• At Time 1 (T1), the sender is transmitting about four packets for every three the receiver can process. The receiver has a five-packet buffer to store unprocessed information; there are two packets in this buffer.
• 在T2 时,发送方已发送4 个数据包,接收方已处理3 个数据包;接收器的缓冲区现在保存着三个数据包。
• At T2, the sender has transmitted four packets, and the receiver has processed three; the buffer at the receiver is now holding three packets.
• 在T3,发送方已发送4 个数据包,接收方已处理3 个数据包;接收器的缓冲区现在保存着四个数据包。
• At T3, the sender has transmitted four packets, and the receiver has processed three; the buffer at the receiver is now holding four packets.
• 在T4,发送方已发送4 个数据包,接收方已处理3 个数据包;接收器的缓冲区现在保存着五个数据包。
• At T4, the sender has transmitted four packets, and the receiver has processed three; the buffer at the receiver is now holding five packets.
下一个传输的数据包将被接收器丢弃,因为在接收器处理数据包时缓冲区中没有空间来存储它,因此可以将其删除。所需要的是某种反馈环路来告诉发送器放慢发送数据包的速率,如图2-10所示。
The next packet transmitted will be dropped by the receiver because there is no space in the buffer to store it while the receiver is processing packets so they can be removed. What is needed is some sort of feedback loop to tell the transmitter to slow down the rate at which it is sending packets, as illustrated in Figure 2-10.
这种反馈环路需要接收器和发射器之间的隐式信令或显式信令。隐式信令得到更广泛的部署。在隐式信令中,发送器根据对流量流的一些观察假设尚未接收到数据包。例如,接收方可能会确认收到某个较晚的数据包,或者接收方可能根本不确认收到特定数据包,或者接收方可能很长一段时间(在网络术语中)不发送任何内容。在显式信令中,接收方以某种方式直接通知发送方尚未收到特定数据包。
This kind of feedback loop requires either implicit signaling or explicit signaling between the receiver and the transmitter. Implicit signaling is more widely deployed. In implicit signaling, the transmitter assumes the packet has not been received based on some observation about the traffic stream. For instance, the receiver may acknowledge the receipt of some later packet, or the receiver may simply not acknowledge receiving a particular packet, or the receiver may not send anything for a long period of time (in network terms). In explicit signaling, the receiver somehow directly informs the sender that a specific packet has not been received.
窗口与隐式信令相结合,是迄今为止在实际网络中部署最广泛的流量控制机制。窗口化主要由以下部分组成:
Windowing, combined with implicit signaling, is by far the most widely deployed flow control mechanism in real networks. Windowing essentially consists of the following:
1. 发送器向接收器发送一定量的信息。
1. A transmitter sends some amount of information to the receiver.
2. 发射机等待一段时间,然后判断信息是否已正确接收。
2. The transmitter waits before deciding if the information has been correctly received or not.
3. 如果接收方在特定时间内确认收到,则发送方发送新信息。
3. If the receiver acknowledges receipt within a specific amount of time, the transmitter sends new information.
4. 如果接收方在特定时间内未确认收到,则发送方将重新发送信息。
4. If the receiver does not acknowledge receipt within a specific amount of time, the transmitter resends the information.
隐式信令通常与窗口协议一起使用,只需不确认特定数据包的接收即可。当接收器知道它已丢弃数据包、接收到的数据包含错误、数据接收无序或数据以某种方式损坏时,有时会使用显式信令。图 2-11说明了最简单的窗口方案,即单个数据包窗口。
Implicit signaling is normally used with windowing protocols by simply not acknowledging the receipt of a particular packet. Explicit signaling is sometimes used when the receiver knows it has dropped a packet, when received data contains errors, data is received out of order, or data is otherwise corrupted in some way. Figure 2-11 illustrates the simplest windowing scheme, a single packet window.
在单个数据包窗口(有时也称为乒乓)中,仅当接收器确认(在图中显示为 ack)收到最后传输的数据包时,发射器才发送数据包。如果没有收到数据包,接收方将不会确认它。发送数据包时,发送者设置一个计时器,通常称为重传定时器;一旦该定时器唤醒(或到期),发送方将假设接收方尚未收到数据包,并重新发送它。
In a single packet window (also sometimes called a ping pong), the transmitter sends a packet only when the receiver has acknowledged (shown as an ack in the illustration) the receipt of the last packet transmitted. If the packet is not received, the receiver will not acknowledge it. On sending a packet, the sender sets a timer, normally called the retransmit timer; once this timer wakes up (or expires), the sender will assume the receiver has not received the packet, and resend it.
发件人应等待多长时间?这个问题有多种可能的答案,但本质上发送方可以等待固定的时间,或者可以根据从先前传输和网络条件推断的信息设置计时器。一个简单(且幼稚)的方案是
How long should the sender wait? There are a number of possible answers to this question, but essentially the sender can either wait a fixed amount of time, or it can set a timer based on information inferred from previous transmissions and network conditions. A simple (and naïve) scheme would be to
• 测量发送数据包和接收确认之间的时间长度,称为往返时间(RTT,但通常写为小写,即rtt)。
• Measure the length of time between sending a packet and receiving an acknowledgment, called the Round Trip Time (RTT, though normally written in the lowercase, so rtt).
• 将重传定时器设置为此数字加上一些少量的缓冲时间,以考虑多次传输中rtt 的任何变化。
• Set the retransmit timer to this number plus some small amount of buffer time to account for any variability in the rtt over multiple transmissions.
笔记
Note
有关计算重传定时器的各种方法的更多信息将在第 5 章中讨论。
More information about various ways to calculate the retransmit timer are considered in Chapter 5.
接收者也有可能收到相同信息的两份副本:
It is also possible for the receiver to receive two copies of the same information:
1. A 发送一个数据包并设置其重传定时器。
1. A transmits a packet and sets its retransmit timer.
2. B收到数据包,但是
2. B receives the packet, but
A。无法确认接收,因为内存不足或处理器利用率较高或某些其他情况。
a. Is not able to acknowledge receipt because it is out of memory or is experiencing high processor utilization or some other condition.
b. 发送确认,但确认被网络设备丢弃。
b. Sends an acknowledgment, but the acknowledgment is dropped by a network device.
3. A 处的重传定时器超时,因此发送方发送数据包的另一个副本。
3. The retransmit timer at A times out, so the sender transmits another copy of the packet.
4. B 收到相同信息的第二份副本。
4. B receives this second copy of the same information.
接收方如何检测重复数据?接收方似乎确实可以比较收到的数据包以查看是否存在重复信息,但这并不总是有效 - 也许发送方打算发送相同的信息两次。检测重复信息的常用方法是在传输的数据包中包含某种序列号。每个数据包都被赋予一个唯一的发送者构建时的序列号;如果接收方收到两个具有相同序列号的数据包,则认为数据是重复的并丢弃副本。
How can the receiver detect duplicated data? It does seem possible for the receiver to compare the packets received to see if there is duplicate information, but this will not always work—perhaps the sender intended to send the same information twice. The usual method of detecting duplicate information is by including some sort of sequence number in transmitted packets. Each packet is given a unique sequence number while being built by the sender; if the receiver receives two packets with the same sequence number, it assumes the data is duplicated and discards the copies.
窗口大小为 1(或乒乓)时,每组传输的数据都需要发送方和接收方之间进行一次往返。这通常会导致传输速率非常慢。如果将网络视为端到端的铁轨,并将每个数据包视为单个火车车厢,则最有效地利用轨道和最快的传输速度将是当轨道始终满载时。然而,对于网络来说,这在物理上是不可能的,因为网络由许多组发送者和接收者使用,并且总是存在网络条件阻止网络利用率达到 100%。一次发送多个数据包所提高的效率和速度与一次发送较少数据包(例如一个数据包)的复用和“安全性”之间存在某种平衡。如图 2-12所示。
A window size of 1, or a ping pong, requires one round trip between the sender and the receiver for each set of data transmitted. This would generally result in a very slow transmission rate. If you think of the network as the end-to-end railroad track, and each packet as a single train car, the most efficient use of the track, and the fastest transmission speed, is going to be when the track is always full. This is not physically possible, however, in the case of a network because the network is used by many sets of senders and receivers, and there are always network conditions that will prevent the network utilization from reaching 100%. There is some balance between the increased efficiency and speed of sending more than one packet at a time, and the multiplexing and “safety” of sending fewer packets at a time (such as one). If a correct balance point can be calculated in some way, a fixed window flow control scheme may work well. Figure 2-12 illustrates.
在图2-12中,假设三数据包固定窗口:
In Figure 2-12, assuming a three-packet fixed window:
• 在T1、T2和T3,A发送数据包;A 不需要等待 B 确认即可发送这三个数据包,因为窗口大小固定为 3。
• At T1, T2, and T3, A transmits packets; A does not need to wait for B to acknowledge anything to send these three packets, as the window size is fixed at 3.
• 在T4,B 确认这三个数据包,这允许A 传输另一个数据包。
• At T4, B acknowledges these three packets, which allows A to transmit another packet.
• 在T5,B 确认这个新数据包,即使它只是一个数据包。B 不需要等到 A 又传输了三个数据包来确认单个数据包。该确认允许 A 有足够的预算再发送三个数据包。
• At T5, B acknowledges this new packet, even though is it only one packet. B does not need to wait until A has transmitted three more packets to acknowledge a single packet. This acknowledgment allows A to have enough budget to send three more packets.
• 在T5、T6 和T7,A 再发送三个数据包,填满其窗口。现在它必须等待 B 确认这三个数据包才能发送更多信息。
• At T5, T6, and T7, A sends three more packets, filling its window. It must now wait until B acknowledges these three packets to send more information.
• 在T8,B 确认收到这三个数据包。
• At T8, B acknowledges the receipt of these three packets.
在窗口大小大于一的加窗方案中,接收器可以向发送器发送四种确认:
In windowing schemes where the window size is more than one, there are four kinds of acknowledgments a receiver can send to the transmitter:
•肯定确认:接收方单独确认每个数据包的接收。例如,如果已接收到序列号 1、3、4 和 5,则接收方将确认接收到这些特定数据包。发送器可以通过记录哪些序列号尚未被确认来推断接收器尚未接收到哪些数据包。
• Positive acknowledgment: The receiver acknowledges the receipt of each packet individually. For instance, if sequence numbers 1, 3, 4, and 5 have been received, the receiver will acknowledge receiving those specific packets. The transmitter can infer which packets the receiver has not received by noting which sequence numbers have not been acknowledged.
•否定确认:接收方对其推断丢失或收到时已损坏的数据包发送否定确认。例如,如果已接收到序列号 1、3、4 和 5,则接收方可能推断序列号 2 丢失,并发送对此数据包的否定确认。
• Negative acknowledgment: The receiver sends a negative acknowledgment for packets it infers are missing, or were corrupted when received. For instance, if sequence numbers 1, 3, 4, and 5 have been received, the receiver may infer that sequence number 2 is missing and send a negative acknowledgment for this packet.
•选择性确认:这基本上结合了正面和负面确认,如上所述;接收器对每个接收到的信息序列发送肯定和否定确认。
• Selective acknowledgment: This essentially combines positive and negative acknowledgment, as above; the receiver sends both positive and negative acknowledgments for each sequence of received information.
•累积确认:确认收到序列号意味着收到具有较低序列号的所有信息。例如,如果序列号 10 被确认,则暗示序列号 1-9 中包含的信息以及序列号 10 中包含的信息。
• Cumulative acknowledgment: Acknowledgment of the receipt of a sequence number implies receipt of all information with lower sequence numbers. For instance, if sequence number 10 is acknowledged, the information contained in sequence numbers 1–9 is implied, as well as the information contained in sequence number 10.
第三种窗口机制称为滑动窗口流量控制。这种机制与固定窗口流控机制非常相似,只不过窗口的大小不固定。在滑动窗口流量控制中,发送器可以随着网络条件的变化动态修改窗口的大小。接收方不知道窗口的大小,只知道发送方传输数据包,并且接收方有时会使用前面列表中描述的确认机制之一来确认其中一些或全部数据包。
A third windowing mechanism is called sliding window flow control. This mechanism is very similar to a fixed window flow control mechanism, except the size of the window is not fixed. In sliding window flow control, the transmitter can dynamically modify the size of the window as network conditions change. The receiver does not know what size the window is, only that the sender transmits packets, and, from time to time, the receiver acknowledges some or all of them using one of the acknowledgment mechanisms described in the preceding list.
滑动窗口机制在其他窗口机制中已考虑的问题中添加了一个更有趣的问题:窗口应该是多大?一个简单的解决方案可能只是计算 rtt 并将窗口大小设置为 rtt 的某个倍数。更复杂的解决方案已经被提出;其中一些将在第 5 章传输控制协议 (TCP) 的讨论中进行考虑。
Sliding window mechanisms add one more interesting question to the questions already considered in other windowing mechanisms: What size should the window be? A naïve solution might just calculate the rtt and set the window size to some multiple of the rtt. More complex solutions have been proposed; some of these will be considered in Chapter 5, in the discussion of the Transmission Control Protocol (TCP).
另一种解决方案更常用于电路交换而不是分组交换网络,它是让发送方、接收方和网络协商任何特定流的比特率。已经为多种不同的网络技术设计了多种可能的比特率;也许“最完整的集合”是针对异步传输模式(ATM) 的——在离您最近的网络历史博物馆中查找 ATM 网络,因为 ATM 很少再部署在生产网络中。ATM 比特率是:
Another solution, more often used in circuit switched rather than packet switched networks, is for the sender, receiver, and network to negotiate a bit rate for any particular flow. A wide array of possible bit rates have been designed for a number of different networking technologies; perhaps the “most complete set” is for Asynchronous Transfer Mode (ATM)—look for ATM networks in your nearest networking history museum, because ATM is rarely deployed in production networks any longer. The ATM bit rates are:
•恒定比特率(CBR):发送方将以恒定速率传输数据包(或信息)。因此,网络可以围绕这个恒定的带宽负载进行规划,并且接收器可以围绕这个恒定的比特率进行规划。该比特率通常用于需要发送器和接收器之间时间同步的应用。
• Constant Bit Rate (CBR): The sender will be transmitting packets (or information) at a constant rate; hence, the network can plan around this constant bandwidth load, and the receiver can plan around this constant bit rate. This bit rate is normally used for applications requiring time synchronization between the sender and receiver.
•可变比特率(VBR):发送方将以可变速率传输流量。该速率通常与有关流量的其他几条信息进行协商,以帮助网络和接收器规划资源,包括:
• Variable Bit Rate (VBR): The sender will be transmitting traffic at a variable rate. This rate is normally negotiated with several other pieces of information about the flow that help the network and the receiver plan resources, including:
•峰值速率,或发送方计划传输的每秒最大数据包数
• The peak rate, or the maximum packets per second the sender plans to transmit
• The sustained rate, or the rate at which the sender plans to transmit normally
•最大突发大小,或发送方打算在很短的时间内传输的最大数据包数量
• The maximum burst size, or the largest number of packets the sender intends to transmit over a very short period of time
•可用比特率(ABR):发送方打算依靠网络的能力在尽力而为的基础上传送流量,使用一些其他形式的流量控制,例如滑动窗口技术,以防止缓冲区溢出并调整将流量传输到可用带宽。
• Available Bit Rate (ABR): The sender intends to rely on the capability of the network to deliver traffic on a best-effort basis, using some other form of flow control, such as a sliding window technique, to prevent buffer overflows and adjust transmitted traffic to the available bandwidth.
本章从理解网络工程问题空间的整个范围的基础知识开始:通过网络传输数据。通过考虑人类语言空间,发现了四个具体问题,并在高层次上提出了几种解决方案:
This chapter begins with the fundamentals of understanding the entire scope of the network engineering problem space: transporting data across the network. Four specific problems were uncovered by considering the human language space, and several solutions were presented at a high level:
• 为了整理数据,考虑了固定长度和基于 TLV 的系统,以及元数据、字典和语法的概念。
• To marshal the data, fixed length and TLV-based systems were considered, as well as the concepts of metadata, dictionaries, and grammars.
• 为了管理错误,考虑了两种方法来检测错误:奇偶校验和CRC;并且考虑了一种纠错方法,即汉明码。
• To manage errors, two methods were considered to detect errors, parity checks and the CRC; and one method was considered for error correction, the Hamming Code.
• 为了允许多个发送方和接收方使用相同的物理介质,考虑了多路复用中的几个概念,包括多播和任播。
• To allow multiple senders and receivers to use the same physical media, several concepts in multiplexing were considered, including multicast and anycast.
• 为了防止缓冲区溢出,探索了多种加窗方式,并定义了协商比特率。
• To prevent buffer overflows, several kinds of windowing were explored, and negotiated bit rates defined.
与您将在本书中遇到的许多其他领域一样,运输世界可以成为一个完整的专业。然而,了解基础知识对于每个网络工程师都很重要。下一章将考虑一些模型,这些模型将帮助您将通常与转发或数据平面相关的数据传输放入更大的上下文中。第 4 章和第 5章将考虑传输协议的几个不同示例,将本章和下一章中的概念引入现实生活中的示例。
Like many other areas you will encounter in this book, the world of transport can become an entire specialty. Understanding the basics, however, is important for every network engineer. The next chapter will consider some models that will help you to put data transport, which is generally associated with forwarding, or the data plane, into a larger context. Chapters 4 and 5 will consider several different examples of transport protocols, pulling the concepts in this chapter and the next into real-life examples.
其中一些进一步阅读资源是为了帮助回答本章的研究问题而提供的。
Some of these further reading resources are provided to help in answering the study questions for this chapter.
康兰、马特. “知道 Anycast 吗?说话之前要三思。” 博客。CacheFly,2017 年 2 月 22 日。https: //insights.cachefly.com/anycast-think-before-you-talk-part-i。
Conran, Matt. “Know Anycast? Think Before You Talk.” Blog. CacheFly, February 22, 2017. https://insights.cachefly.com/anycast-think-before-you-talk-part-i.
———。“选播——说话之前要三思。” 博客。CacheFly,2017 年 2 月 22 日。https: //insights.cachefly.com/anycast-think-before-you-talk-part-ii。
———. “Anycast—Think Before You Talk.” Blog. CacheFly, February 22, 2017. https://insights.cachefly.com/anycast-think-before-you-talk-part-ii.
“国旗纪念日。” 访问日期:2017 年 6 月 6 日。http ://www.catb.org/jargon/html/F/flag-day.html。
“Flag Day.” Accessed June 6, 2017. http://www.catb.org/jargon/html/F/flag-day.html.
格莱克、詹姆斯. 信息:历史、理论、洪水。纽约:复古,2012。
Gleick, James. The Information: A History, A Theory, A Flood. New York: Vintage, 2012.
“Grpc/。” 访问日期:2017 年 6 月 7 日。http: //www.grpc.io/docs/tutorials/basic/c.html。
“Grpc /.” Accessed June 7, 2017. http://www.grpc.io/docs/tutorials/basic/c.html.
互联网协议。征求意见 791。RFC 编辑,1981。doi:10.17487/RFC0791。
Internet Protocol. Request for Comments 791. RFC Editor, 1981. doi:10.17487/RFC0791.
Koopman, P.“互联网应用的 32 位循环冗余码”。可靠系统和网络国际会议论文集,459–68,2002 年。doi:10.1109/DSN.2002.1028931。
Koopman, P. “32-Bit Cyclic Redundancy Codes for Internet Applications.” In Proceedings International Conference on Dependable Systems and Networks, 459–68, 2002. doi:10.1109/DSN.2002.1028931.
P. 库普曼和 T. Chakravarty。“嵌入式网络的循环冗余码 (CRC) 多项式选择。” 国际可靠系统和网络会议,2004 年,145–54,2004 年。doi:10.1109/DSN.2004.1311885。
Koopman, P., and T. Chakravarty. “Cyclic Redundancy Code (CRC) Polynomial Selection for Embedded Networks.” In International Conference on Dependable Systems and Networks, 2004, 145–54, 2004. doi:10.1109/DSN.2004.1311885.
洛夫莱斯、乔什、雷·布莱尔和阿温德·杜莱。IP 组播,第一卷:Cisco IP 组播网络。第一版。印第安纳州印第安纳波利斯:思科出版社,2016 年。
Loveless, Josh, Ray Blair, and Arvind Durai. IP Multicast, Volume I: Cisco IP Multicast Networking. 1st edition. Indianapolis, IN: Cisco Press, 2016.
———。IP 组播,第二卷:高级组播概念和大规模组播设计。第一版。印第安纳州印第安纳波利斯:思科出版社,2017 年。
———. IP Multicast, Volume II: Advanced Multicast Concepts and Large-Scale Multicast Design. 1st edition. Indianapolis, IN: Cisco Press, 2017.
麦克弗森、丹尼·R.、大卫·奥兰、戴夫·泰勒和埃里克·奥斯特韦尔。IP 选播的架构考虑因素。征求意见 7094。RFC 编辑,2014。doi:10.17487/RFC7094。
McPherson, Danny R., David Oran, Dave Thaler, and Eric Osterweil. Architectural Considerations of IP Anycast. Request for Comments 7094. RFC Editor, 2014. doi:10.17487/RFC7094.
Moon,Todd K。纠错编码:数学方法和算法。第一版。新泽西州霍博肯:Wiley-Interscience,2005 年。
Moon, Todd K. Error Correction Coding: Mathematical Methods and Algorithms. 1st edition. Hoboken, NJ: Wiley-Interscience, 2005.
莫雷洛斯-萨拉戈萨,罗伯特 H.纠错编码的艺术。第二版。奇切斯特;新泽西州霍博肯:Wiley,2006。
Morelos-Zaragoza, Robert H. The Art of Error Correcting Coding. 2nd edition. Chichester; Hoboken, NJ: Wiley, 2006.
莫伊、约翰. “OSPF 规范。” 请求评论。RFC 编辑,1989 年 10 月。doi:10.17487/RFC1131。
Moy, John. “OSPF Specification.” Request for Comment. RFC Editor, October 1989. doi:10.17487/RFC1131.
———。“OSPF 版本 2。” 请求评论。RFC 编辑,1998 年 4 月。doi:10.17487/RFC2328。
———. “OSPF Version 2.” Request for Comment. RFC Editor, April 1998. doi:10.17487/RFC2328.
布雷特·帕尔森、普拉山斯·库马尔、萨米尔·贾弗拉利和扎伊德·阿里·卡恩。“TCP over IP 选播——白日梦还是现实?” 博客。LinkedIn 工程博客,2015 年 9 月。https ://engineering.linkedin.com/network-performance/tcp-over-ip-anycast-pipe-dream-or-reality。
Palsson, Bret, Prashanth Kumar, Samir Jafferali, and Zaid Ali Kahn. “TCP over IP Anycast—Pipe Dream or Reality?” Blog. LinkedIn Engineering Blog, September 2015. https://engineering.linkedin.com/network-performance/tcp-over-ip-anycast-pipe-dream-or-reality.
Postel, J. NCP/TCP 过渡计划。征求意见 801。RFC 编辑,1981。doi:10.17487/RFC0801。
Postel, J. NCP/TCP Transition Plan. Request for Comments 801. RFC Editor, 1981. doi:10.17487/RFC0801.
香农、克劳德·E.和沃伦·韦弗。通信的数学理论。第四版。伊利诺伊州尚佩恩:伊利诺伊大学出版社,1949 年。
Shannon, Claude E., and Warren Weaver. The Mathematical Theory of Communication. 4th edition. Champaign, IL: University of Illinois Press, 1949.
索尼、吉米和罗布·古德曼。游戏中的思维:克劳德·香农如何发明信息时代。纽约:西蒙与舒斯特,2017。
Soni, Jimmy, and Rob Goodman. A Mind at Play: How Claude Shannon Invented the Information Age. New York: Simon & Schuster, 2017.
Stone,James V.信息论:教程简介。第一版。英国:Sebtel 出版社,2015 年。
Stone, James V. Information Theory: A Tutorial Introduction. 1st edition. England: Sebtel Press, 2015.
“了解 ATM VC 的 CBR 服务类别。” 思科。访问日期:2017 年 6 月 10 日。http ://www.cisco.com/c/en/us/support/docs/asynchronous-transfer-mode-atm/atm-traffic-management/10422-cbr.html。
“Understanding the CBR Service Category for ATM VCs.” Cisco. Accessed June 10, 2017. http://www.cisco.com/c/en/us/support/docs/asynchronous-transfer-mode-atm/atm-traffic-management/10422-cbr.html.
沃伦,亨利·S· 《黑客的喜悦》。第二版。新泽西州上萨德尔河:Addison-Wesley Professional,2012。
Warren, Henry S. Hacker’s Delight. 2nd edition. Upper Saddle River, NJ: Addison-Wesley Professional, 2012.
威廉姆森,博. 开发 IP 组播网络,第一卷。印第安纳州印第安纳波利斯:思科出版社,1999 年。
Williamson, Beau. Developing IP Multicast Networks, Volume I. Indianapolis, IN: Cisco Press, 1999.
1. 虽然 TLV 几乎总是需要比固定长度字段更多的空间来承载一条信息,但在某些情况下,固定长度字段的效率会较低。携带 IPv6 地址是 TLV 的一种特定实例,比固定长度字段更有效。描述一下这是为什么。比较路由协议携带 IPv4 和 IPv6 地址的方式是理解答案的一个很好的起点。特别是,检查 OSPF 版本 2 中携带 IPv4 地址的方式,并将其与 BGP 中携带这些相同地址的方式进行比较。
1. While TLVs almost always require more space to carry a piece of information than a fixed length field, there are some cases where the fixed length field will be less efficient. Carrying IPv6 addresses is one specific instance of a TLV being more efficient than a fixed length field. Describe why this is. Comparing the way routing protocols carry IPv4 and IPv6 addresses is a good place to start in understanding the answer. In particular, examine the way IPv4 addresses are carried in OSPF version 2, and compare this with the way these same addresses are carried in BGP.
2. 考虑以下数据类型,并确定是否使用固定长度字段或 TLV 来承载每种数据类型,以及原因。
2. Consider the following data types and determine whether you would use a fixed length field or a TLV to carry each one, and why.
A。时间和日期
a. The time and date
b. 一个人的全名
b. A person’s full name
d. 建筑物的平方英尺
d. The square footage of a building
e. 一系列音频或视频剪辑
e. A series of audio or video clips
F。一本书分为段落和章节等部分
f. A book broken down into sections such as paragraphs and chapters
G。地址中的城市和州
g. The city and state in an address
H。地址中的门牌号码或邮政编码
h. The house number or postal code in an address
3. 误码率 (BER) 与检测和/或修复数据传输流中的错误所需的信息量之间有什么关系?你能解释一下为什么会这样吗?
3. What is the relationship between the bit error rate (BER) and the amount of information required to detect and/or repair errors in a data transmission stream? Can you explain why this might be?
4. 在某些情况下,发送足够的信息以纠正接收时的数据更有意义(例如使用汉明码)。在其他情况下,发现错误并丢弃数据更有意义。然而,这些条件不仅仅是链接类型或应用程序;它们将是两者的结合。什么样的链路特性,结合什么样的应用特性,会建议使用 FEC?哪些建议建议将错误检测与重新传输数据结合使用?最好首先考虑特定的应用程序和特定的链接类型,然后从中进行概括。
4. Under some conditions, it makes more sense to send enough information to correct data on receipt (such as using a Hamming code). In others, it makes more sense to discover the error and throw the data away. These conditions would not be just the link type, however, or just the application; they would be a combination of the two. What link characteristics, combined with what kinds of application characteristics, would suggest the use of FEC? Which ones would suggest the use of error detection combined with retransmitting the data? It might be best to think of specific applications and specific link types first, and then generalize from there.
5.奇偶校验可以检测多少位翻转变化?
5. How many bit flip changes can a parity check detect?
6. 隐式和显式信令具有不同的特征,或者更确切地说是不同的权衡。描述用于错误检测和/或纠正的每种形式的信令的至少一个积极方面和一个消极方面。
6. Implicit and explicit signaling have different characteristics, or rather different tradeoffs. Describe at least one positive and one negative aspect of each form of signaling for error detection and/or correction.
7. 在大规模部署任播时,来自单个流的数据包可能被传送到多个接收者。这个问题有两种广泛的解决方案;第一个是如果数据包以这种方式出现错误传递,接收方将强制发送方重置其状态。另一种方法是以允许状态包含在单个事务中的方式限制发送者和接收者之间的接口。后一种解决方案的一种形式称为原子事务,通常在 RESTful 接口中实现。考虑这两种可能的解决方案,并描述应用程序的类型,给出可能更适合这两种解决方案的具体应用程序示例。
7. In a large-scale deployment of anycast, it is possible for packets from a single stream to be delivered to multiple receivers. There are two broad solutions to this problem; the first is for receivers to force the sender to reset their state if a packet appears to be misdelivered in this way. Another is to constrict the interface between the sender and receiver in a way that allows state to be contained to a single transaction. One form of this latter solution is called atomic transactions, and is often implemented in RESTful interfaces. Consider these two possible solutions, and describe the kinds of applications, giving specific examples of applications, that might be better suited for each of these two solutions.
8. 你会一直考虑元数据的字典和语法形式吗?为什么或者为什么不?
8. Would you always consider the dictionary and the grammar forms of meta-data? Why or why not?
9. 找到其他三种元数据,这些元数据不涉及数据的格式化方式,而是以对试图理解特定过程(例如在两个帐户之间转移资金)的攻击者可能有用的方式描述数据。对于元数据的定义是否存在具体限制,或者说“元数据是情人眼里出西施”更准确?
9. Find three other kinds of metadata that do not involve the way the data is formatted, but rather describe the data in a way that might be useful to an attacker trying to understand a specific process, such as transferring funds between two accounts. Is there a specific limit to what might be considered metadata, or is it more accurate to say “metadata is in the eye of the beholder”?
10. 考虑本章末尾解释的协商比特率。是否有可能真正在分组交换网络中提供恒定的比特率?您的答案取决于网络条件吗?如果是这样,什么条件会影响问题的答案?
10. Consider the negotiated bit rates explained toward the end of the chapter. Is it possible to truly provide a constant bit rate in a packet switched network? Does your answer depend on the network conditions? If so, what conditions would impact the answer to the question?
1 . “国旗纪念日。”
1. “Flag Day.”
2 . Postel,NCP/TCP 过渡计划。
2. Postel, NCP/TCP Transition Plan.
3 . 互联网协议这句话或类似的话出自 Jon Postel。
3. Internet Protocol This quote, or something similar, is attributed to Jon Postel.
4 . 莫伊,OSPF 版本 2,201。
4. Moy, OSPF Version 2, 201.
5 . Palsson 等人,“TCP over IP Anycast——白日梦还是现实?”
5. Palsson et al., “TCP over IP Anycast—Pipe Dream or Reality?”
前一章考虑的一系列问题和解决方案提供了对网络传输系统复杂性的一些见解。工程师如何处理此类系统中明显的复杂性?
The set of problems and solutions considered in the preceding chapter provides some insight into the complexity of network transport systems. How can engineers engage with the apparent complexity involved in such systems?
第一种方法是查看运输系统解决的基本问题,并了解针对每个问题的可用解决方案的范围。第二个是建立模型,通过以下方式帮助理解传输协议:
The first way is to look at the basic problems transport systems solve, and understand the range of solutions available for each of those problems. The second is to build models that will aid in the understanding of transport protocols by
• 帮助工程师根据传输协议的用途、每个协议包含的信息以及协议之间的接口对传输协议进行分类
• Helping engineers classify transport protocols by their purpose, the information each protocol contains, and the interfaces between protocols
• 帮助工程师知道要问哪些问题才能理解特定协议,或了解特定协议如何与其运行的网络以及其承载信息的应用程序进行交互
• Helping engineers know which questions to ask in order to understand a particular protocol, or to understand how a particular protocol interacts with the network over which it runs, and the applications that it carries information for
• 帮助工程师了解单个协议如何组合在一起构建传输系统
• Helping engineers understand how single protocols fit together to make a transport system
第一章“基本概念”对交通问题和解决方案空间进行了高级概述。本章将讨论工程师可以更全面地理解协议的第二种方式:模型。模型本质上是前一章中考虑的问题和解决方案的抽象表示;它们提供了更加直观和以模块为中心的表示,展示了事物如何组合在一起。本章将思考这个问题:
Chapter 1, “Fundamental Concepts,” provided a high-level overview of the transport problem and solution spaces. This chapter will tackle the second way in which engineers can understand protocols more fully: models. Models are essentially abstract representations of the problems and solutions considered in the previous chapter; they provide a more visual and module-focused representation, showing how things fit together. This chapter will consider this question:
如何对传输系统进行建模,使工程师能够快速、全面地掌握这些系统需要解决的问题,以及如何组合多种协议来解决这些问题?
How can transport systems be modeled in a way that allows engineers to quickly and fully grasp the problems these systems need to solve, as well as the way multiple protocols can be put together to solve them?
本章将考虑三个具体模型:
Three specific models will be considered in this chapter:
• 美国国防部 (DoD) 模型
• The United States Department of Defense (DoD) model
• 开放系统互连 (OSI) 模型
• The Open Systems Interconnect (OSI) model
• 递归互联网架构 (RINA) 模型
• The Recursive Internet Architecture (RINA) model
这三种模型都有不同的目的和历史。本章还将考虑第二种形式的协议分类,即面向连接与无连接。
Each of these three models has a different purpose and history. A second form of protocol classification, connection oriented versus connectionless, will also be considered in this chapter.
20 世纪 60 年代,美国国防高级研究计划局 (DARPA) 资助开发分组交换网络,以取代电话网络,成为计算机通信的主要手段。与神话相反,最初的想法不是为了在核爆炸中幸存,而是为随后在几所大学、研究机构和政府办公室使用的各种计算机创造一种相互通信的方式。当时,每个计算机系统都使用自己的物理布线、协议和其他系统;没有办法互连这些设备,甚至无法传输数据文件,更不用说创建“万维网”或交叉执行软件之类的东西了。这些原创模型通常设计用于提供终端到主机的通信,因此您可以将远程终端安装到办公室或共享空间中,然后可以使用该终端来访问系统或主机的共享资源。许多围绕这些模型的原创文章都反映了这一现实。
In the 1960s, the US Defense Advanced Research Projects Agency (DARPA) sponsored the development of a packet switched network to replace the telephone network as a primary means of computer communications. Contrary to the myth, the original idea was not to survive a nuclear blast, but rather to create a way for the various computers then being used at several universities, research institutes, and government offices to communicate with one another. At the time, each computer system used its own physical wiring, protocols, and other systems; there was no way to interconnect these devices in order to even transfer data files, much less create anything like the “world wide web,” or cross-execute software. These original models were often designed to provide terminal-to-host communications, so you could install a remote terminal into an office or shared space, which could then be used to access the shared resources of the system, or host. Much of the original writing around these models reflects this reality.
该领域最早的发展之一是 DoD 模型,如图 3-1所示。
One of the earliest developments in this area was the DoD model, shown in Figure 3-1.
国防部模型将通过网络传输信息的工作分为四个不同的功能,每个功能都可以由多种协议之一执行。直到 20 世纪 80 年代末,甚至到 90 年代初,在每一层采用多个协议的想法一直被认为是有争议的。事实上,DoD 与 OSI 模型的原始版本之间的主要区别之一是每一层都有多个协议的概念。
The DoD model separated the job of transporting information across a network into four distinct functions, each of which could be performed by one of many protocols. The idea of having multiple protocols at each layer was considered somewhat controversial until the late 1980s, and even into the early 1990s. In fact, one of the key differences between the DoD and the original incarnation of the OSI model is the concept of having multiple protocols at each layer.
在国防部模型中:
In the DoD model:
•物理层负责将0 和1 调制或串行化到物理链路上。每种链路类型都有不同的格式来发送 0 或 1;物理层负责将 0 和 1 转换为物理信号。
• The physical layer is responsible for getting the 0s and 1s modulated, or serialized, onto the physical link. Each link type has a different format for signaling a 0 or a 1; the physical layer is responsible for translating 0s and 1s into physical signals.
•互联网层负责在未通过单个物理链路连接的系统之间传输数据。那么,互联网层提供网络范围的地址,而不是链路本地地址,并且还提供一些方法来发现必须跨越才能到达这些目的地的一组设备和链路。
• The internet layer is responsible for transporting data between systems that are not connected through a single physical link. The internet layer, then, provides networkwide addresses, rather than link local addresses, and also provides some means for discovering the set of devices and links that must be crossed to reach these destinations.
•传输层负责在通信设备之间建立和维护会话,并为数据流或数据块提供通用的透明数据传输机制。流量控制和可靠传输也可以在这一层中实现,就像 TCP 的情况一样。
• The transport layer is responsible for building and maintaining sessions between communicating devices and providing a common transparent data transmission mechanism for streams or blocks of data. Flow control and reliable transport may also be implemented in this layer, as in the case of TCP.
•应用层是用户与网络资源或使用数据并向连接到网络的其他设备提供数据的特定应用程序之间的接口。
• The application layer is the interface between the user and the network resources, or specific applications that use and provide data to other devices attached to the network.
尤其是应用层,在网络传输模型中似乎格格不入。为什么使用数据的应用程序应被视为传输系统的一部分?因为早期的系统认为人类用户是数据的最终用户,而应用程序主要是一种整理数据以呈现给实际用户的方式。许多机器对机器的处理、在将数据呈现给用户之前对数据进行大量处理以及以数字格式简单存储信息甚至不被认为是可行的用例。由于信息从一个人传输到另一个人,因此应用程序仅被视为传输系统的一部分。
The application layer, in particular, seems out of place in a model of network transport. Why should the application using the data be considered part of the transport system? Because early systems considered the human user the ultimate user of the data, and the application as primarily a way to munge data to be presented to the actual user. Much of the machine-to-machine processing, heavy processing of data before it is presented to a user, and simple storage of information in digital format were not even considered viable use cases. As information was being transferred from one person to another, the application was just considered a part of the transport system.
另外两点可能有助于使应用程序的包含更有意义。首先,在这些原始系统的设计中,有两个组件:终端和主机。终端实际上是一个显示设备;该应用程序驻留在主机上。其次,网络软件并不被认为是系统中的一个单独的“东西”;那时还没有发明路由器,也没有任何其他单独的设备来处理和转发数据包。相反,主机只是连接到终端或另一台主机;网络软件只是这些设备上运行的另一个应用程序。
Two other points might help the inclusion of the application make more sense. First, in the design of these original systems, there were two components: a terminal and a host. The terminal was really a display device; the application lived on the host. Second, the networking software was not thought of as a separate “thing” in the system; routers had not yet been invented, nor any other separate device to process and forward packets. Rather, a host was just connected to either a terminal or another host; the network software was just another application running on these devices.
随着时间的推移,随着 OSI 模型得到更广泛的使用,DoD 模型进行了修改以包含更多层。例如,在图 3-2(从 1983 年关于 DoD 模型的论文中复制的图表)中,有七层(出于某种原因,七是一个神奇的数字)。1
Over time, as the OSI model came into more regular use, the DoD model was modified to include more layers. For instance, in Figure 3-2, a diagram replicated from a 1983 paper on the DoD model, there are seven layers (seven being a magic number for some reason).1
这里添加了三层:
Here three layers have been added:
• 实用层是一组位于更通用的传输层和应用程序之间的协议。具体来说,简单邮件传输协议 (SMTP)、文件传输协议 (FTP) 和其他协议被视为该层的一部分。
• The utility layer is a set of protocols living between the more generic transport layer and applications. Specifically, the Simple Mail Transfer Protocol (SMTP), File Transfer Protocol (FTP), and other protocols were seen as being a part of this layer.
• 网络层从四层版本开始分为网络层和网际层。网络层代表每种链路类型使用的不同数据包格式,例如无线电网络和以太网(在 20 世纪 80 年代初仍然很新)。互联网层将网络上运行的应用程序和实用协议的视图统一到单个互联网数据报服务中。
• The network layer from the four-layer version has been divided into the network layer and the internetwork layer. The network layer represents the differing packet formats used on each link type, such as radio networks and Ethernet (still very new in the early 1980s). The internetwork layer unifies the view of the applications and utility protocols running on the network into a single internet datagram service.
• 插入链路层是为了区分各种链路类型上的信息编码以及设备与物理链路的连接。并非所有硬件接口都提供链路层。
• The link layer has been inserted to differentiate between the encoding of information onto the various link types and a device’s connection to the physical link. Not all hardware interfaces provided a link layer.
随着时间的推移,这些扩展的国防部模型不再受欢迎。四层模型是当今最常被引用的模型。有几个原因:
Over time, these expanded DoD models fell out of favor; the four-layer model is the one most often referenced today. There are several reasons for this:
• 在大多数情况下,实用程序层和应用程序层本质上是彼此重复的。例如,FTP 在传输控制协议 (TCP) 之上多路复用内容,而不是作为堆栈中的单独协议或层。TCP 和用户数据报协议 (UDP) 最终固化为传输层中的两个协议,其他所有内容(通常)都在这两个协议之一上运行。
• The utility and application layers are essentially duplicates of one another in most cases. FTP, for instance, multiplexes content on top of the Transmission Control Protocol (TCP), rather than as a separate protocol or layer in the stack. TCP and the User Datagram Protocol (UDP) eventually solidified as the two protocols in the transport layer, with everything else (generally) running on top of one of these two protocols.
• 随着主要用于转发数据包的设备(路由器和交换机)的发明,网络层和互联网络层之间的分离被事件所克服。最初的区别主要是低速长途(广域)链路和短程局域链路之间的区别。路由器通常将安装广域网链路的负担从主机上移开,因此差异化变得不那么重要。
• With the invention of devices primarily intended to forward packets (routers and switches), the separation between the network and internetwork layers was overcome by events. The original differentiation was primarily between lower-speed long haul (wide area) links and shorter-run local area links; routers generally took the burden of installing links into wide area networks out of the host, so the differentiation became less important.
• 某些接口类型根本无法将信号编码与主机接口分开,正如链路层和物理层之间的分离所设想的那样。因此,在国防部模型中,这两层通常被整合为一个“事物”。
• Some interface types simply do not have a way to separate signal encoding from the host interface, as was envisioned in the split between the link and physical layers. Hence these two layers are generally munged into a single “thing” in the DoD model.
国防部模型具有重要的历史意义,因为
The DoD model is historically important because
• 这是将网络功能编入模型的首次尝试之一。
• It is one of the first attempts to codify network functionality into a model.
• 它是 TCP/IP 协议套件(全球互联网在其上运行)的设计模型;该模型的工件对于理解 TCP/IP 协议设计的许多方面非常重要。
• It is the model on which the TCP/IP suite of protocols (on which the global Internet operates) was designed; the artifacts of this model are important in understanding many aspects of TCP/IP protocol design.
• 它具有“内置”模型中任何特定层的多个协议的概念。这为缩小任何特定协议的焦点,同时允许许多不同的协议在同一网络上同时运行的总体概念奠定了基础。
• It had the concept of multiple protocols at any particular layer in the model “built in.” This set the stage for the overall concept of narrowing the focus of any particular protocol, while allowing many different protocols to operate at once over the same network.
在 20 世纪 60 年代,一直到 80 年代,主要的通信形式是交换电路。发送方会要求网络元件(交换机)将其连接到特定的接收方,交换机将完成连接(如果接收方不忙),并且流量将通过结果电路传输。如果这听起来像传统的电话系统,那是因为它实际上是基于传统的网络系统(现在称为普通老式电话服务 [POTS])。大型电话和计算机公司对这种模式进行了大量投资,并从围绕电路交换技术设计的系统中获得了大量收入。随着国防部模型(及其配套的协议和概念)开始受到研究人员的欢迎,这些现任者决定建立一个新的标准组织,反过来,建立一个提供“两全其美”的替代系统。他们将结合分组交换的最佳元素,同时保留电路交换的最佳元素,创建一个可以满足每个人的新标准。1977 年,这个新的标准组织被提出并被采纳,作为国际标准化组织 (ISO) 的一部分。
In the 1960s, carrying through to the 1980s, the primary form of communications was the switched circuit; a sender would ask a network element (a switch) to connect it to a particular receiver, the switch would complete the connection (if the receiver was not busy), and traffic would be transmitted over the resulting circuit. If this sounds like a traditional telephone system, this is because it is, in fact, based on the traditional network system (now called Plain Old Telephone Service [POTS]). Large telephone and computer companies were deeply invested in this model, and received a lot of revenue from systems designed around circuit switching techniques. As the DoD model (and its set of accompanying protocols and concepts) started to catch on with researchers, these incumbents decided to build a new standards organization that would, in turn, build an alternate system providing the “best of both worlds.” They would incorporate the best elements of packet switching, while retaining the best elements of circuit switching, creating a new standard that would satisfy everyone. In 1977, this new standards organization was proposed, and adopted, as part of the International Organization for Standardization (ISO).
这个新的 ISO 工作组设计了一个分层模型,类似于提议的(已被拒绝的)基于数据包的模型,以数据库通信为基础。主要目标是允许在 20 世纪 70 年代末占主导地位的大型数据库系统之间进行相互通信。该委员会分为电信工程师和数据库团队,这使得标准变得复杂。这开发的协议需要提供面向连接和无连接的会话控制,并发明整个应用程序套件来创建电子邮件、文件传输和许多其他应用程序(请记住,应用程序是堆栈的一部分)。例如,需要对各种运输方式进行编码以提供广泛的服务。1989 年,整整十年过去了,规范仍然没有完全完成。尽管许多政府、大型计算机制造商和电信公司通过国防部协议栈和模型支持该协议,但该协议尚未得到广泛部署。
This new ISO working group designed a layered model similar to the proposed (and rejected) packet-based model, grounded in database communications. The primary goal was to allow intercommunication between the large database-focused systems dominant in the late 1970s. The committee was divided between telecom engineers and the database contingent, making the standards complex. The protocols developed needed to provide for both connection-oriented and connectionless session control, and invent the entire application suite to create email, file transfer, and many other applications (remember, applications are part of the stack). For instance, various transport modes needed to be codified to carry a wide array of services. In 1989—a full ten years later—the specifications were still not completely done. The protocol had not reached widespread deployment, even though many governments, large computer manufacturers, and telecom companies supported it over the DoD protocol stack and model.
但在这十年中,国防部堆栈不断发展;互联网工程任务组 (IETF) 的成立是为了引导 TCP/IP 协议栈,主要服务于研究人员和大学(当时的互联网不允许商业流量,直到 1992 年才允许)。随着 OSI 协议的失败,许多商业网络和网络设备“立即”转向 TCP/IP 协议套件来解决现实世界的问题。
But during the ten years the DoD stack continued to develop; the Internet Engineering Task Force (IETF) was formed to shepherd the TCP/IP protocol stack, primarily for researchers and universities (the Internet, as it was then known, did not allow commercial traffic, and would not until 1992). With the failure of the OSI protocols to materialize, many commercial networks, and networking equipment, turned to the TCP/IP protocol suite to solve real-world problems “right now.”
此外,由于 TCP/IP 协议栈的开发是由美国政府资助的,因此该规范是免费的。事实上,由于大学和研究生的研究工作需要这些实现,因此存在为各种可用系统编写的 TCP/IP 实现。然而,OSI 规范只能从 ISO 本身以纸质形式购买,并且只能由 ISO 成员购买。ISO 被设计成一个“仅限会员”的俱乐部,旨在让现任者牢牢控制分组交换技术的发展。然而,该组织“仅限会员”的性质对现任者不利,最终导致了他们的衰落。
Further, because the development of the TCP/IP protocol stack was being paid for under grants by the U.S. government, the specifications were free. There were, in fact, TCP/IP implementations written for a wide range of systems available because of the work of universities and graduate students who needed the implementations for their research efforts. The OSI specifications, however, could only be purchased in paper form from the ISO itself, and only by members of the ISO. The ISO was designed to be a “members only” club, meant to keep the incumbents firmly in control of the development of packet switching technology. The “members only” nature of the organization, however, worked against the incumbents, eventually playing a role in their decline.
然而,OSI 模型对网络的进步做出了许多贡献。例如,对服务质量 (QoS) 和路由问题的认真关注在随后的几年中取得了成效。其中一项主要贡献是明确模块化的概念;互连具有许多不同要求的许多不同系统的复杂性促使 OSI 社区呼吁明确的职责范围以及各层之间定义良好的接口。
The OSI model, however, made many contributions to the advancement of net-working; for instance, the careful attention paid to Quality of Service (QoS) and routing issues paid dividends in the years after. One major contribution was the concept of clear modularity; the complexity of interconnecting many different systems, with many different requirements, drove the OSI community to call for clear lines of responsibility, and well-defined interfaces between the layers.
第二个是机器对机器通信的概念。中间的盒子,当时称为网关,现在称为路由器和交换机,被明确视为网络模型的一部分,如图3-3所示。
A second was the concept of machine-to-machine communication. Middle boxes, then called gateways, now called routers and switches, were explicitly considered part of the networking model, as shown in Figure 3-3.
您可能甚至不需要看到此图像就可以记住 OSI 模型 - 每个参加过网络课程或学习过网络工程认证的人都熟悉使用七层模型来描述网络的工作方式。
You probably do not even need to see this image to remember the OSI model— everyone who’s ever been through a networking class, or studied for a network engineering certification, is familiar with using the seven-layer model to describe the way networks work.
以这种方式建模网络的巧妙之处在于,它使各个部分之间的交互变得更容易查看和理解。每对层在模型中垂直移动,通过套接字或应用程序进行交互编程接口(API)。因此,要连接到特定的物理端口,数据链路层的一段代码将连接到该端口的套接字。这使得各层之间的交互能够被抽象和标准化。网络层软件不需要知道如何处理各种物理接口,只需要知道如何将数据传送给同一系统上的数据链路层软件即可。
The genius of modeling a network in this way is it makes the interactions between the various pieces much easier to see and understand. Each pair of layers, moving vertically through the model, interacts through a socket, or Application Programming Interface (API). So to connect to a particular physical port, a piece of code at the data link layer would connect to the socket for that port. This allows the interaction between the various layers to be abstracted and standardized. A piece of software at the network layer does not need to know how to deal with various sorts of physical interfaces, only how to get data to the data link layer software on the same system.
每层都有一组特定的功能要执行。
Each layer has a specific set of functions to perform.
物理层,也称为第 1 层,负责将 0 和 1 调制或串行化到物理链路上。每种链路类型都有不同的格式来表示 0 或 1;物理层负责将 0 和 1 转换为这些物理信号。
The physical layer, also called layer 1, is responsible for getting the 0s and 1s modulated, or serialized, onto the physical link. Each link type will have a different format for signaling a 0 or 1; the physical layer is responsible for translating 0s and 1s into these physical signals.
数据链路层,也称为第 2 层,负责使某些传输的信息实际上发送到连接到同一链路的正确计算机。每个设备都有不同的数据链路(第 2 层)地址,可用于将流量发送到特定设备。数据链路层假设信息流中的每个帧都与同一流中的所有其他帧分开,并且仅为通过单个物理链路连接的设备提供通信。
The data link layer, also called layer 2, is responsible for making certain transmitted information is actually sent to the right computer connected to the same link. Each device has a different data link (layer 2) address that can be used to send traffic to a specific device. The data link layer assumes each frame within a flow of information is separate from all other frames within the same flow, and only provides communication for devices connected through a single physical link.
网络层,也称为第 3 层,负责在未通过单个物理链路连接的系统之间传输数据。那么,网络层提供网络范围(或第 3 层)地址,而不是链路本地地址,并且提供了一些方法来发现到达这些目的地所必须跨越的一组设备和链路。
The network layer, also called layer 3, is responsible for transporting data between systems not connected through a single physical link. The network layer, then, provides networkwide (or layer 3) addresses, rather than link local addresses, and also provides some means for discovering the set of devices and links that must be crossed to reach these destinations.
传输层也称为第4层,负责不同设备之间数据的透明传输。传输层协议可以是“可靠的”,这意味着传输层将重新传输在某个较低层丢失的数据,也可以是“不可靠的”,这意味着在较低层丢失的数据必须由某个较高层的应用程序重新传输。
The transport layer, also called layer 4, is responsible for the transparent transfer of data between different devices. Transport layer protocols can be either be “reliable,” which means the transport layer will retransmit data lost at some lower layer, or “unreliable,” which means data lost at lower layers must be retransmitted by some higher layer application.
会话层,也称为第 5 层,并不真正传输数据,而是管理运行在两台不同计算机上的应用程序之间的连接。会话层使得数据的类型、数据的形式以及数据流的可靠性都被暴露和考虑。
The session layer, also called layer 5, does not really transport data, but rather manages the connections between applications running on two different computers. The session layer makes certain the type of data, the form of the data, and the reliability of the data stream are all exposed and accounted for.
表示层,也称为第 6 层,实际上以某种方式格式化数据,以允许在两个设备上运行的应用程序理解和处理数据。加密、流量控制以及在应用程序和网络之间提供接口所需的任何其他数据操作都发生在这里。应用程序通过套接字与表示层交互。
The presentation layer, also called layer 6, actually formats data in a way to allow the application running on the two devices to understand and process the data. Encryption, flow control, and any other manipulation of data required to provide an interface between the application and the network happen here. Applications interact with the presentation layer through sockets.
应用层,也称为第 7 层,提供用户和应用程序之间的接口,应用程序又通过表示层与网络进行交互。
The application layer, also called layer 7, provides the interface between the user and the application, which in turn interacts with the network through the presentation layer.
不仅可以在七层模型中精确地描述各层之间的交互,而且可以精确地描述多台计算机上并行层之间的交互。第一设备上的物理层可以与第二设备上的物理层通信,第一设备上的数据链路层与第二设备上的数据链路层通信,等等。正如设备上两层之间的交互是通过套接字处理的一样,不同设备上并行层之间的交互也是通过网络协议处理的。
Not only can the interaction between the layers be described in precise terms within the seven-layer model, the interaction between parallel layers on multiple computers can be described precisely. The physical layer on the first device can be said to communicate with the physical layer on the second device, the data link layer on the first device with the data link layer on the second device, and so on. Just as interactions between two layers on a device are handled through sockets, interactions between parallel layers on different devices are handled through network protocols.
以太网描述了物理线路上的 0 和 1 信号、启动和停止数据帧的格式,以及在连接到单条线路的所有设备中寻址单个设备的方法。因此,以太网属于 OSI 模型中的物理层和数据链路层(1 和 2)。
Ethernet describes the signaling of 0s and 1s onto a physical piece of wire, a format for starting and stopping a frame of data, and a means of addressing a single device among all the devices connected to a single wire. Ethernet, then, falls within both the physical and data link layers (1 and 2) in the OSI model.
IP 描述了将数据格式化为数据包的方式,以及通过多个数据链路层链路发送数据包以到达几跳之外的设备所需的寻址和其他方法。因此,IP 属于 OSI 模型的网络层 (3)。
IP describes the formatting of data into packets, and the addressing and other means necessary to send packets across multiple data link layer links to reach a device several hops away. IP, then, falls within the network layer (3) of the OSI model.
TCP 描述会话建立和维护、数据重传以及与应用程序的交互。因此,TCP 属于 OSI 模型的传输层和会话层(第 4 层和第 5 层)。
TCP describes session setup and maintenance, data retransmission, and interaction with applications. TCP, then, falls within the transport and session layers (4 and 5) of the OSI model.
对于只接触过 TCP/IP 协议栈的工程师来说,更令人困惑的一点是 OSI 堆栈中设计的协议与设备交互的不同方式。在 TCP/IP 中,地址指的是接口(并且,在具有大量虚拟化的网络世界中,多个地址可以指代单个接口,或至选播服务,或多播数据流等)。然而,在 OSI 模型中,每个设备都有一个地址。这意味着 OSI 模型中的协议通常由它们设计连接的设备类型来引用。例如,通过网络承载可达性和拓扑(或路由)信息的协议称为中间系统到中间系统(IS-IS)协议,因为它在中间系统之间运行。还有一种协议旨在允许中间系统发现终端系统;这称为终端系统到中间系统(ES-IS) 协议(您没想到有创意的名称,是吗?)。
One of the more confusing points for engineers who only ever encounter the TCP/IP protocol stack is the different way the protocols designed in/for the OSI stack interact with devices. In TCP/IP, addresses refer to interfaces (and, in a world of net-works with a lot of virtualization, multiple addresses can refer to a single interface, or to an anycast service, or to a multicast data stream, etc.). In the OSI model, however, each device has a single address. This means the protocols in the OSI model are often referred to by the types of devices they are designed to connect. For instance, the protocol carrying reachability and topology (or routing) information through the network is called the Intermediate System to Intermediate System (IS-IS) protocol, because it runs between intermediate systems. There is also a protocol designed to allow intermediate systems to discover end systems; this is called the End System to Intermediate System (ES-IS) protocol (you did not expect creative names, did you?).
笔记
Note
TCP/IP 协议套件的支持者很早就对 OSI 协议套件产生了厌恶,甚至拒绝接受其开发过程中吸取的教训,这是网络工程历史上令人悲哀的事实之一。虽然近年来这在很大程度上已经变成了一种相当温和的乐趣,但由于基于其起源而不是其技术优点而拒绝协议的岁月,是网络工程中谦逊的一课。关注想法,而不是人;向每个人、每个项目学习;不要让你的自我妨碍更大的项目或解决手头的问题。
It is one of the sad facts of network engineering history that proponents of the TCP/IP protocol suite developed an early dislike of the OSI protocol suite, to the point of rejecting the lessons learned in their development. While this has largely worn down into a rather more mild bit of fun in more recent years, the years lost to rejecting a protocol based on its origins, rather than its technical merits, are a lesson in humility in network engineering. Focus on the ideas, rather than the people; learn from everyone and every project you can; do not allow your ego to get in the way of the larger project, or solving the problem at hand.
DoD 和 OSI 模型有两个共同点:
The DoD and OSI models have two particular focal points in common:
• 它们都包含应用层;这在早期网络工程领域是有意义的,因为应用程序和网络软件都是更大系统的一部分。
• They both contain application layers; this makes sense in the context of the earlier world of network engineering, as the application and network software were all part of a larger system.
• 它们将哪些数据应包含在何处的概念与特定层要实现哪些目标的概念相结合。
• They combine the concepts of what data should be contained where with the concept of what goal is accomplished by a particular layer.
这会导致一些奇怪的问题,例如
This leads to some odd questions, such as
• 边界网关协议(BGP) 在独立实体(自治系统)之间提供路由(可达性),在两种模型中都运行在传输层之上。这使它成为一个应用程序吗?同时,该协议提供网络层运行所需的可达性信息。这是否使其成为网络层协议?
• The Border Gateway Protocol (BGP), which provides routing (reachability) between independent entities (autonomous systems), runs on top of the transport layer in both models. Does this make it an application? At the same time, this protocol is providing reachability information the network layer needs to operate. Does this make it a network layer protocol?
• IPsec 将信息添加到互联网协议(IP) 标头中,并指定通过网络传送的信息的加密。因为 IP 是一个网络层,而 IPsec(某种程度上)运行在 IP 之上,这是否使 IPsec 成为一种传输协议?或者,因为 IPsec 与 IP 并行运行,所以它是一个网络层协议吗?
• IPsec adds information to the Internet Protocol (IP) header, and specifies the encryption of information being carried across the network. Because IP is a network layer, and IPsec (sort of) runs on top of IP, does this make IPsec a transport protocol? Or, because IPsec run parallel to IP, is it a network layer protocol?
在技术会议或标准会议上讨论此类问题可以提供很多乐趣;然而,他们也指出这些模型的定义方式存在一定程度的模糊性。这种模糊性来自于这些模型中形式和功能的精心混合。它们是否描述了信息包含在何处、谁使用该信息、对信息做了什么,或者解决通过网络传输信息的特定问题需要满足的特定目标?答案是——以上全部。或许,这取决于。
Arguing over these kinds of questions can provide a lot of entertainment at a technical conference or standards meeting; however, they also point to some amount of ambiguity in the way these models are defined. The ambiguity comes from the careful mixture of form and function found in these models; do they describe where information is contained, who uses the information, what is done to the information, or a specific goal that needs to be met to resolve a specific problem in transporting information through a network? The answer is—all of the above. Or perhaps, it depends.
这导致了以下观察:任何数据承载协议实际上只能提供四种功能:传输、复用、纠错和流量控制。如果这些听起来很熟悉,那么它们应该很熟悉——因为这与第 2 章“数据传输问题和解决方案”中人类语言研究中发现的四个功能相同。
This leads to the following observation: there are really only four functions any data-carrying protocol can serve: transport, multiplexing, error correction, and flow control. If these sound familiar, they should—because these are the same four functions uncovered in the investigation of human language in Chapter 2, “Data Transport Problems and Solutions.”
这四个功能中有两个自然分组:传输和复用、错误和流量控制。因此,大多数协议都会做以下两件事之一:
There are two natural groupings within these four functions: transport and multiplexing, error and flow control. So most protocols fall into doing one of two things:
• 协议提供传输,包括从一种数据格式到另一种数据格式的某种形式的转换;多路复用是协议将来自不同主机和应用程序的数据分开的能力。
• The protocol provides transport, including some form of translation from one data format to another; and multiplexing, the capability of the protocol to keep data from different hosts and applications separate.
• 该协议通过纠正小错误或重新传输丢失或损坏的数据的能力来提供错误控制。流量控制,防止由于网络传输数据的能力与应用程序生成数据的能力不匹配而造成的不当数据丢失。
• The protocol provides error control, either through the capability to correct small errors or to retransmit lost or corrupted data; and flow control, which prevents undue data loss because of a mismatch between the network’s capability to deliver data and the application’s capability to generate data.
从这个角度来看,以太网提供传输服务和流量控制,因此它是一个混合包,集中在网络内的单个链路、端口到端口(或隧道端点到隧道端点)上。IP 是一种提供传输服务的多跳协议(跨越多个物理链路的协议),而 TCP 是一种使用 IP 传输机制并提供纠错和流量控制的多跳协议。图 3-4说明了迭代模型。
From this perspective, Ethernet provides transport services and flow control, so it is a mixed bag concentrated on a single link, port to port (or tunnel endpoint to tunnel endpoint) within a network. IP is a multihop protocol (a protocol that spans more than one physical link) providing transport services, while TCP is a multihop protocol that uses IP’s transport mechanisms and provides error correction and flow control. Figure 3-4 illustrates the iterative model.
模型的每一层都具有相同的两个功能之一,只是范围不同。该模型尚未在网络协议工作中广泛流行,但它提供了比任一模型更简单的网络协议动态和操作视图七层或四层模型,并添加了范围的概念,这对于考虑网络运行至关重要。信息范围是网络稳定性和弹性的基础。
Each layer of the model has one of the same two functions, just at a different scope. This model has not caught on widely in network protocol work, but it provides a much simpler view of network protocol dynamics and operations than either the seven- or four-layer models, and it adds in the concept of scope, which is of vital importance in considering network operation. The scope of information is the foundation of network stability and resilience.
迭代模型还使面向连接和无连接网络协议的概念再次成为人们关注的焦点。
The iterative model also brings the concepts of connection-oriented and connectionless network protocols out into the light of day again.
面向连接的协议在发送第一位数据之前建立端到端连接,包括传输有意义数据的所有状态。状态可以包括服务质量要求、流量通过网络的路径、发送和接收数据的特定应用程序、发送数据的速率以及其他信息。一旦建立连接,就可以以很少的开销传输数据。
Connection-oriented protocols set up an end-to-end connection, including all the state to transfer meaningful data, before sending the first bit of data. The state could include such things as the Quality of Service requirements, the path the traffic will take through the network, the specific applications that will send and receive the data, the rate at which data can be sent, and other information. Once the connection is set up, data can be transferred with very little overhead.
另一方面,无连接服务将传输数据所需的数据与数据本身结合起来,在单个数据包(或协议数据单元)中携带两者。无连接协议只是将通过网络传输数据所需的状态传播到可能需要数据的每个可能的设备,而面向连接的模型将状态限制为仅需要了解特定数据包流的设备。结果是,可以通过将流量移动到另一条可能的路径来修复无连接网络中的单个设备或链路故障,而不是重做构建状态以继续将流量从源传输到目的地所需的所有工作。
Connectionless services, on the other hand, combine the data required to transmit data with the data itself, carrying both in a single packet (or protocol data unit). Connectionless protocols simply spread the state required to carry data through the network to every possible device that might need the data, while connection-oriented models constrain state to only devices that need to know about a specific flow of packets. The result is single device or link failures in a connectionless network can be healed by moving the traffic onto another possible path, rather than redoing all the work needed to build the state to continue carrying traffic from source to destination.
大多数现代网络都是使用无连接传输模型与面向连接的服务质量、错误控制和流量控制模型相结合构建的。这种组合并不总是理想的。例如,服务质量通常沿着特定路径配置,以匹配应遵循这些路径的特定流。这种将服务质量视为更面向连接的方式所管理的实际流量会导致网络的理想状态与各种可能的故障模式之间存在严重脱节。
Most modern networks are built with connectionless transport models combined with connection-oriented Quality of Service, error control, and flow control models. This combination is not always ideal; for instance, Quality of Service is normally configured along specific paths to match specific flows that should be following those paths. This treatment of Quality of Service as more connection oriented than the actual traffic flows being managed causes strong disconnects between the ideal state of a network and various possible failure modes.
了解多种模型以及它们如何应用于各种网络协议,可以帮助您快速了解以前未遇到过的协议并诊断运营网络中的问题。了解协议模型的历史可以帮助您理解为什么特定协议是这样设计的,特别是协议设计者认为需要解决的问题,以及最初设计协议时围绕该协议的协议。不同类型的模型以不同的方式抽象出一组协议;了解多个模型,以及如何将一组协议放入每个模型中,可以帮助您以不同的方式而不是单一的方式理解协议操作,就像在画中看到的花瓶与在画中看到的花瓶有很大不同一样三维演示。
Knowing a number of models, and how they apply to various network protocols, can help you quickly understand a protocol you have not encountered before and diagnose problems in an operational network. Knowing the history of the protocol models can help you understand why particular protocols were designed the way they were, particularly the problems the protocol designers thought needed to be solved, and the protocols surrounding the protocol when it was originally designed. Different kinds of models abstract a set of protocols in different ways; knowing several models, and how to fit a set of protocols into each of the models, can help you understand the protocol operation in different ways, rather than a single way, much like seeing a vase in a painting is far different than seeing it in a three-dimensional presentation.
特别重要的是无连接和面向连接的协议这两个概念。这两个概念将是理解流量控制、错误管理和许多其他协议操作的基础。
Of particular importance are the two concepts of connectionless and connection-oriented protocols. These two concepts will be foundational in understanding flow control, error management, and many other protocol operations.
下一章将把这些模型应用到较低层的传输协议中。
The next chapter is going to apply these models to lower layer transport protocols.
瑟夫 (Vinton G.) 和爱德华·凯恩 (Edward Cain)。“国防部互联网架构模型。” 计算机网络7 (1983): 307–18。
Cerf, Vinton G., and Edward Cain. “The DoD Internet Architecture Model.” Computer Networks 7 (1983): 307–18.
Day, J.网络架构模式:回归基础。印第安纳州印第安纳波利斯:培生教育,2007 年。
Day, J. Patterns in Network Architecture: A Return to Fundamentals. Indianapolis, IN: Pearson Education, 2007.
格拉萨,爱德华. “递归互联网络架构的设计原理。” 在第三届 FIArch 研讨会上。布鲁塞尔,2011 年。http://www.future-internet.eu/fileadmin/documents/fiarch23may2011/06-Grasa_DesignPrinciplesOTheRecursiveInterNetworkArchitecture.pdf。
Grasa, Eduard. “Design Principles of the Recursive InterNetwork Architecture.” In 3rd FIArch Workshop. Brussels, 2011. http://www.future-internet.eu/fileadmin/documents/fiarch23may2011/06-Grasa_DesignPrinciplesOTheRecursiveInterNetworkArchitecture.pdf.
Maathuis,I.,和 WA Smit。“标准之间的斗争:TCP/IP 与 OSI 的胜利是通过路径依赖还是质量?” 信息技术标准化与创新,2003 年。第三届会议,161-76,2003 年。doi:10.1109/SIIT.2003.1251205。
Maathuis, I., and W. A. Smit. “The Battle between Standards: TCP/IP Vs OSI Victory through Path Dependency or by Quality?” In Standardization and Innovation in Information Technology, 2003. The 3rd Conference on, 161–76, 2003. doi:10.1109/SIIT.2003.1251205.
迈克尔·A·帕德利斯基 (Padlipsky),《网络风格的要素》以及有关计算机间网络艺术的其他论文和动画。普伦蒂斯·霍尔,1985。
Padlipsky, Michael A. The Elements of Networking Style and Other Essays and Animadversions on the Art of Intercomputer Networking. Prentice-Hall, 1985.
Russell, Andrew L.“OSI:不存在的互联网”。专业组织。IEEE Spectrum,2016 年 9 月 27 日。https: //spectrum.ieee.org/tech-history/cyberspace/osi-the-internet-that-wasnt。
Russell, Andrew L. “OSI: The Internet That Wasn’t.” Professional Organization. IEEE Spectrum, September 27, 2016. https://spectrum.ieee.org/tech-history/cyberspace/osi-the-internet-that-wasnt.
怀特、拉斯和丹尼斯·多诺霍。网络架构的艺术:业务驱动的设计。第一版。印第安纳州印第安纳波利斯:思科出版社,2014 年。
White, Russ, and Denise Donohue. The Art of Network Architecture: Business-Driven Design. 1st edition. Indianapolis, IN: Cisco Press, 2014.
1. 研究 X.25 堆栈中的协议,它早于本章描述的三种网络模型。X.25协议栈是否呈现分层设计?X.25 堆栈中的每个协议适合 DoD 和 OSI 模型的哪一层?您能用 RINA 模型描述每个协议吗?
1. Research the protocols in the X.25 stack, which predates the three network models described in this chapter. Does the X.25 protocol stack show a layered design? Which layers of the DoD and OSI models does each protocol in the X.25 stack fit into? Can you describe each protocol in terms of the RINA model?
2. 研究 IBM Systems Network Architecture (SNA) 堆栈中的协议,这些协议早于本章描述的三种网络模型。SNA协议栈是否呈现分层设计?SNA 堆栈中的每个协议适合 DoD 和 OSI 模型的哪一层?您能用 RINA 模型描述每个协议吗?
2. Research the protocols in the IBM Systems Network Architecture (SNA) stack, which predates the three network models described in this chapter. Does the SNA protocol stack show a layered design? Which layers of the DoD and OSI models does each protocol in the SNA stack fit in to? Can you describe each protocol in terms of the RINA model?
3. 某些协议栈和模型(例如 X.25 堆栈)会考虑计费,而其他协议栈和模型则不会考虑计费。您认为为什么会出现这种情况?考虑在 IP 和 X.25 堆栈中使用网络利用率的方式,特别是使用带宽与数据包作为主要测量系统。
3. Billing is considered in some protocol stacks and models (such as the X.25 stack), and not in others. Why do you think this might be the case? Consider the way in which network utilization is used in the IP and X.25 stacks, specifically the use of bandwidth versus packets as a primary measurement system.
4. 分层网络模型如何促进网络协议栈的模块化?
4. How does a layered network model contribute to the modularity of network protocol stacks?
5. 分层网络模型如何提高工程师对网络工作原理的理解?
5. How does a layered network model improve an engineer’s understanding of how a network works?
6. 绘制比较 DoD 和 OSI 模型的图表。一个模型的每一层是否都能完美地融入另一个模型?
6. Draw a diagram comparing the DoD and OSI models. Does each layer from one model fit neatly into the other?
7. 考虑OSI和RINA模型;您能找出 RINA 模型中的哪些服务适合 OSI 模型中的哪些层吗?
7. Consider the OSI and RINA models; can you figure out which services from the RINA model fit into which layers in the OSI model?
8. 根据状态/优化/表面模型,特别是在状态和优化方面,考虑协议操作的无连接模型与面向连接模型。您能否解释一下,在面向连接的模型中添加状态可以在哪些方面提高网络资源的最佳利用?它如何降低网络资源的最佳利用?
8. Consider the connectionless versus connection-oriented models of protocol operation in light of the State/Optimization/Surface model, specifically in terms of state and optimization. Can you explain where adding state in a connection-oriented model increases optimal use of network resources? How does it decrease the optimal use of network resources?
9. 在旧的网络模型中,应用程序通常被视为协议栈的一部分。然而,随着时间的推移,应用程序似乎已经很大程度上从网络协议栈中分离出来,并被视为网络服务的“用户”或“消费者”。您能否想到终端主机设计中与终端主机上运行的应用程序相关的特定转变会导致网络工程思维的转变?
9. In older network models, applications were often considered part of the protocol stack. Over time, however, applications seem to have been largely separated out of the network protocol stack, and considered as a “user” or “consumer” of network services. Can you think of a particular shift in the design of end hosts in relationship to the applications running on end hosts that would cause this shift in thinking in network engineering?
10. 从协议设计的角度来看,您认为固定长度数据包(或帧或单元)比可变长度数据包更有意义吗?与固定长度格式相比,可变长度数据包格式添加了多少状态?获得了多少优化?回答这个问题的一个有用的出发点是通过全球互联网传输的平均数据包长度的列表或图表。
10. Do you think fixed length packets (or frames, or cells) make more sense from a protocol design perspective than variable length packets? How much state does a variable length packet format add compared to a fixed length format? How much optimization is gained? A useful point of departure for answering this question would be a list or chart of the average packet lengths carried through the global Internet.
1 . Cerf 和 Cain,“国防部互联网架构模型”。
1. Cerf and Cain, “The DoD Internet Architecture Model.”
数据传输协议通常是分层的,较低层沿单跳提供服务,中间层集在两个设备之间提供端到端服务,并且可能有一组层在两个应用程序或两个应用程序之间提供端到端服务。单个应用程序的实例。如图 4-1所示。
Data transport protocols are often layered, with lower layers providing services along a single hop, a middle set of layers providing services end to end between two devices, and, potentially, a set of layers providing services end to end between two applications, or two instances of a single application. Figure 4-1 illustrates.
每组协议都显示为一对协议,因为如前一章中的递归互联网架构 (RINA) 模型所示,传输协议通常成对出现,并且该对中的每个协议都承担特定的功能。本章将考虑物理和数据链路协议,如图4-1所示。具体来说,本章将考虑网络中点对点传输的两种广泛使用的协议:以太网和 WiFi (802.11)。
Each set of protocols is shown as a pair of protocols, because—as shown in the Recursive Internet Architecture (RINA) model in the previous chapter—transport protocols normally come in pairs, with each protocol in the pair taking on specific functions. This chapter will consider the physical and datalink protocols, as shown in Figure 4-1. Specifically, this chapter will consider two widely used protocols for point-to-point transport in networks: Ethernet and WiFi (802.11).
许多旨在允许多台计算机共享单根电线的早期机制都是基于更多面向电话的技术所采用的设计。他们通常专注于令牌传递和其他更具确定性的方案,以确保两个设备不会尝试同时使用单一共享的电介质。以太网由鲍勃·梅特卡夫(Bob Metcalf,当时在施乐工作)于 20 世纪 70 年代初发明,以不同的方式解决了重叠的通话者问题,即通过一组非常简单的规则来防止大多数重叠传输,然后解决任何重叠传输通过检测和退避。
Many of the early mechanisms designed to allow multiple computers to share a single wire were based on designs adopted from more telephone-oriented technologies. They generally focused on token passing and other more deterministic schemes for ensuring two devices did not try to use the single shared electrical medium at the same time. Ethernet, invented in the early 1970s by Bob Metcalf (who was working at Xerox at the time), resolved overlapping talkers in a different way—through a very simple set of rules to prevent the majority of overlapping transmissions, and then resolving any overlapping transmissions through detection and backoff.
任何与物理介质交互的协议的最初焦点都将集中在复用领域,因为在解决第一个问题之前,几乎没有其他问题可以解决。因此,本节将从以太网复用组件的描述开始,然后转向其他操作方面。
The initial focus of any protocol that interacts with a physical medium is going to be in the area of multiplexing, as few other problems can be addressed until this first problem is solved. Therefore, this section will begin with a description of the multiplexing components of Ethernet and then move to other operational aspects.
要了解以太网首次发明时面临的多路复用问题,请考虑以下问题:在共享介质网络中,整个共享介质是单个电路(或电线)。
To understand the multiplexing problem Ethernet faced when it was first invented, consider the following problem: In a shared medium network, the entire shared medium is a single electrical circuit (or wire).
当一台主机发送数据包时,网络上的所有其他主机都会收到该信号。这很像在露天环境中进行的对话;一个声音每个听众都能听到通过公共媒介(空气)传输的声音。在传输过程中没有物理方法来限制侦听器的集合。
When one host transmits a packet, every other host on the network receives the signal. This is much like a conversation held in an open air environment; a sound transmitted over the common medium (the air) is heard by every listener. There is no physical way to restrict the set of listeners during the transmission process.
由此产生的系统称为带冲突检测的载波侦听多路访问 (CSMA/CD),使用一组步骤进行操作:
The resulting system, called Carrier Sense Multiple Access with Collision Detection (CSMA/CD), operates using a set of steps:
1. 主机监听介质以查看是否有正在进行的传输;这是该过程的载波侦听部分。
1. The host listens on the medium to see if there are any existing transmissions in progress; this is the carrier sense part of the process.
2. 当听到没有其他传输正在进行时,主机将开始将帧中的位序列化到线路上。
2. On hearing there is no other transmission in progress, the host will begin serializing the bits in the frame onto the wire.
这部分很简单——发送前先听一下即可。当然,两个(或多个)主机的传输可能会发生冲突,如图4-2所示。
This part is simple—just listen before transmitting. It is possible, of course, for the transmissions of two (or more) hosts to collide as Figure 4-2 illustrates.
在图 4-2中:
In Figure 4-2:
1. 在时间 1 (T1),A 开始将帧传输到共享介质上。信号从电线的一端传输到另一端需要一定的时间;这称为传播延迟。
1. At time 1 (T1), A begins transmitting a frame onto the shared medium. It takes some amount of time for the signal to travel from one end of the wire to the other; this is called the propagation delay.
2. 在时间 2 (T2),C 侦听线路上的信号,如果没有检测到信号,则开始将帧传输到共享介质上。此时已经发生了冲突,因为 A 和 C 都在同一时刻发送帧,但双方都还没有检测到冲突。
2. At time 2 (T2), C listens for a signal on the wire, and, detecting none, begins transmitting a frame onto the shared medium. A collision has already occurred at this point, as both A and C are transmitting a frame at the same moment, but neither of them has yet detected the collision.
3. 在时间 3 (T3),两个信号实际上在线上发生冲突,导致它们都格式错误,因此无法读取。
3. At time 3 (T3), the two signals actually collide on the wire, causing them both to be malformed, and hence unreadable.
当来自 C 的信号到达 A 时,可以在 A 处检测到冲突,方法是让 A 在信号传输到线路时监听自己的信号。当来自C的信号到达A时,A将收到由两个信号组合而导致的畸形信号(冲突的结果)。这是CSMA/CD操作的冲突检测部分(CD部分)。
A collision can be detected at A at the moment the signal from C reaches A by having A listen to its own signal as it is transmitted onto the wire. When the signal from C reaches A, A will receive the malformed signal caused by the combination of the two signals (the result of the collision). This is the collision detection portion (the CD portion) of CSMA/CD operation.
当主机检测到冲突时应该做什么?在最初的以太网设计中,主机将发送足够长的干扰信号,以迫使连接到线路的任何其他主机感知到冲突并停止传输。干扰信号的长度最初被设置为使得干扰信号将至少消耗在线路的整个长度上传输最大尺寸的帧所需的时间量。为什么是这个特定的时间?
What should a host do when it detects a collision? In the original Ethernet design, the host will send a jam signal long enough to force any other host connected to the wire to sense the collision and stop transmitting. The length of the jam signal was originally set so the jam signal would consume at least the amount of time required to transmit a maximum-sized frame on the wire across the entire length of the wire. Why this specific amount of time?
• 如果使用比最大帧短的帧来确定传输拥塞信号的时间量,则具有旧接口(无法同时发送和接收)的主机在传输单个大的拥塞信号时实际上可能会错过整个拥塞信号。帧,使堵塞信号无效。
• If a shorter than maximum frame was used in determining the amount of time the jam signal is transmitted, then a host with older interfaces (which cannot send and receive at the same time) may actually miss the entire jam signal while transmitting a single large frame, making the jam signal ineffective.
• 重要的是要让连接在线路末端的主机有足够的时间来接收堵塞信号,以便它们感知到冲突并采取以下步骤。
• It is important to allow enough time for the hosts connected at the very end of the wires to receive the jam signal, so they will sense the collision and take the following steps.
一旦收到堵塞信号,连接到线路的每个主机都会设置一个后退计时器,以便它们在尝试再次传输之前都会等待一段随机的时间。由于这些计时器被设置为随机数,因此当具有等待传输帧的两台主机尝试进行下一次传输时,不应再次发生冲突。
Once the jam signal is received, each host connected to the wire will set a back-off timer so they will each wait some random amount of time before attempting to transmit again. Because these timers are set to a random number, when the two hosts with frames waiting to be transmitted attempt their next transmission, the collision should not occur again.
如果连接到单根电线的每个主机大致在同一时间接收到相同的信号(给定通过电线的传播延迟),则任何特定主机如何知道它是否应该实际接收特定帧(或者更确切地说,复制帧内的信息)从线路到本地内存)?这是媒体访问控制 (MAC) 地址的工作。
If every host connected to the single wire receives the same signal at roughly the same time (given propagation delay through the wire), how does any particular host know whether it should actually receive a particular frame (or rather, copy the information within a frame from the wire to local memory)? This is the job of Media Access Control (MAC) addresses.
每个物理接口都分配有(至少)一个 MAC 地址。每个以太网帧包含源MAC地址和目的MAC地址;框架的格式如下目标 MAC 地址在任何数据之前接收。一旦收到完整的目标 MAC 地址,主机就可以决定是否继续接收数据包。如果目标地址与接口地址匹配,主机将继续将信息从线路复制到内存中。如果目标地址与本地接口地址不匹配,主机将停止接收数据包。
Each physical interface is assigned (at least) one MAC address. Each Ethernet frame contains a source and destination MAC address; the frame is formatted so the destination MAC address is received before any data. Once the entire destination MAC address has been received, a host can decide whether it should continue receiving the packet or not. If the destination address matches the interface address, the host continues copying information off the wire and into memory. If the destination address does not match the local interface address, the host simply stops receiving the packet.
MAC 地址重复怎么办?如果连接到同一介质的多个主机具有相同的物理地址,则它们各自将接收并可能处理相同的帧。有多种方法可以检测重复的 MAC 地址,但这些方法是作为层间发现的一部分而不是以太网本身来实现的;这些将在第 6 章“层间发现”中讨论。以太网本身假设
What about duplicate MAC addresses? If multiple hosts connected to the same medium have the same physical address, they would each receive, and potentially process, the same frames. There are ways to detect duplicate MAC addresses, but these are implemented as part of interlayer discovery rather than Ethernet itself; these will be considered in Chapter 6, “Interlayer Discovery.” Ethernet itself assumes either
• 如果手动分配MAC 地址,则系统管理员将正确分配MAC 地址。
• MAC addresses will be properly assigned by the system administrator, if they are manually assigned.
• MAC 地址将由设备制造商分配,因此无论有多少主机相互连接,都不会出现重复的 MAC 地址。
• MAC addresses will be assigned by the device manufacturer so duplicate MAC addresses never occur, no matter how many hosts are connected to one another.
由于 MAC 地址通常在每个路由器上都会被重写(有关详细信息,请参阅第 7 章),因此它们只需要在网段或广播域内是唯一的。虽然许多旧系统努力确保每个网段或广播域的唯一性,但这通常必须通过手动配置来强制执行,因此在很大程度上已被放弃,转而尝试为每个设备提供“嵌入”以太网芯片组的全局唯一 MAC 地址当它被创建时。
Because MAC addresses are normally rewritten at every router (see Chapter 7 for more information), they only need to be unique within the segment or broadcast domain. While many older systems strove to ensure per segment or broadcast domain uniqueness, this must normally be enforced through manual configuration, and hence has largely been abandoned in favor of attempting to provide each device with a globally unique MAC address “baked into” the Ethernet chipset when it is created.
第一种方案在大多数大型网络中很难实现;手动配置 MAC 地址在现实世界中极为罕见,甚至不存在。第二个选项本质上意味着 MAC 地址必须分配给各个设备,因此世界上没有两个设备共享相同的 MAC 地址。这怎么可能?通过从标准组织管理的中央存储库中分配 MAC 地址。如图 4-3所示。
The first solution is difficult to implement in most large-scale networks; manual configuration of MAC addresses is extremely rare in the real world to the point of nonexistence. The second option essentially means MAC addresses must be assigned to individual devices so no two devices in the world share the same MAC address. How is this possible? By assigning MAC addresses out of a central repository managed through a standards organization. Figure 4-3 illustrates.
MAC 地址分为两部分:组织唯一标识符 (OUI) 和网络接口标识符。网络接口标识符由以太网芯片组制造商分配。生产以太网芯片组的公司又由电气和电子工程师协会 (IEEE) 分配组织标识符。只要组织(或制造商)始终将其 OUI 位于 MAC 地址的前三个八位字节中的芯片组分配地址,并且不在 MAC 地址的后三个八位字节中为任何两个设备分配相同的网络接口标识符,对于任何以太网芯片组来说,两个 MAC 地址都不应该相同。
The MAC address is broken up into two sections: an Organizationally Unique Identifier (OUI) and a network interface identifier. The network interface identifier is assigned by the manufacturer of the Ethernet chipset. Companies producing Ethernet chipsets, in turn, are assigned the organizational identifiers by the Institute of Electrical and Electronic Engineers (the IEEE). So long as an organization (or manufacturer) always assigns addresses to a chipset with its OUI in the first three octets of the MAC address, and does not assign any two devices the same network interface identifier in the last three octets of the MAC address, no two MAC addresses should be the same for any Ethernet chipset.
OUI 空间中预留了两位来表示 MAC 地址是否已在本地分配(这意味着制造商分配的 MAC 地址已被设备配置覆盖),以及 MAC 地址是否旨在用作以下地址之一:
Two bits within the OUI space are set aside to signal whether the MAC address has been locally assigned (which means the manufacturer’s assigned MAC address has been overridden by the device’s configuration), and whether the MAC address is intended as one of the following:
• 单播地址,这意味着它描述单个接口
• Unicast address, which means it describes a single interface
• 组播地址,这意味着它描述了一组接收者
• Multicast address, which means it describes a group of receivers
MAC地址由48位组成;删除这两位后,MAC 地址空间为 46 位,这意味着它可以描述 2 46(或 70,368,744,177,664)个可寻址接口。由于这(可能)不足以解释新的可寻址设备(例如蓝牙耳机和传感器)的快速增长,因此 MAC 地址的长度增加到 64 位以创建 EUI-64 MAC 地址,该地址的构造如下:与较短的 48 位 MAC 地址相同。这些地址可以支持 2 62或 4,611,686,018,427,387,904 个可寻址接口。
The MAC address consists of 48 bits; with these two bits removed, the MAC address space is 46 bits, which means it can describe 246—or 70,368,744,177,664— addressable interfaces. Because this is (potentially) not enough to account for the rapid number of new addressable devices, such as Bluetooth headsets and sensors, the length of a MAC address was increased to 64 bits to create the EUI-64 MAC address, which is constructed in the same way as the shorter 48-bit MAC address. These addresses can support 262—or 4,611,686,018,427,387,904—addressable interfaces.
在大多数网络中,以太网部署的共享介质模型已在很大程度上(尽管不是完全!)被取代。现在大多数以太网部署不是共享介质,而是交换的,这意味着通过将每个设备连接到交换机上的端口,将单个电路或单根电线分成多个电路。如图 4-4所示。
The shared medium model of Ethernet deployment has largely (though not completely!) been replaced in most networks. Rather than a shared medium, most Ethernet deployments now are switched, which means the single electrical circuit, or the single wire, is broken up into multiple circuits by connecting each device to a port on a switch. Figure 4-4 illustrates.
在图 4-4中,每个设备都连接到一组不同的电线,所有电线均终止于一个开关。如果三台主机(A、B 和 C)的网络接口以及交换机网络接口可以随时发送或接收,而不是两者都可以,则 A 可以发送,而交换机也在发送。在这种情况下,即使在只有两个发射器连接到同一线路的网络上,仍然必须遵循 CSMA/CD 过程以防止冲突。这种操作模式称为半双工。
In Figure 4-4, each device is connected to a different set of wires, all of which terminate in a single switch. If the network interfaces at the three hosts (A, B, and C), and the switch network interfaces, can send or receive at any moment in time, rather than being able to do both, it is possible for A to send while the switch is also sending. In this case, the CSMA/CD process must still be followed in order to prevent collisions, even on networks where only two transmitters are connected to the same wire. This mode of operation is called half duplex.
然而,如果以太网芯片组可以同时侦听和传输以检测冲突,则可以改变这种情况。管理此问题的最简单方法是将接收和传输信号放置在以太网电缆中使用的一组电线中的不同物理电线上。使用不同的电线意味着两个连接系统的传输不会发生冲突,因此芯片组可以同时传输和接收。为了实现这种称为全双工的操作模式,双绞线以太网在一对电线上传输一个方向的信号,在另一组电线上传输相反方向的信号。在这种情况下,不再需要CSMA/CD。
If the Ethernet chipsets can both listen and transmit at the same time in order to detect collisions, however, this situation can be changed. The easiest way to manage this is to place the receive and transmit signals on different physical wires within the set of wires used in the Ethernet cable. Using different wires means there is no way for the transmissions from the two connected systems to collide, so the chipset can both transmit and receive at the same time. To enable this mode of operation, called full duplex, twisted pair Ethernet carries the signal in one direction on one pair of wires, and the signal in the opposite direction on another set of wires. In this case, CSMA/CD is no longer needed.
交换机必须了解哪个设备(主机)连接到每个端口才能使该系统正常工作;第 15 章“距离矢量控制平面”中讨论了了解交换网络中可到达的目的地。
The switch must learn which device (host) is connected to each port for this system to work; learning about the reachable destinations in a switched network is considered in Chapter 15, “Distance Vector Control Planes.”
CSMA/CD 旨在防止以太网中的一种可检测错误:冲突导致帧格式错误。然而,与任何其他电气或光学系统一样,其他类型的错误也可能会渗入信号中。例如,在双绞线布线系统中,如果在安装连接器时双绞线“松开”过多,一根电线可能会通过磁干扰将其信号传输到另一根电线,从而导致串扰。当信号沿着电线传播时,它可以到达电线的另一端,并沿着电线的长度反射回来。
CSMA/CD is designed to prevent one kind of detectable error in Ethernet: when collisions cause a frame to be malformed. Other kinds of errors can slip into a signal, however, as with any other electrical or optical system. For instance, in a twisted pair cabling system, if the twisted wires are “unwound” too much in installing a connector, one wire can transfer its signal to another wire through magnetic interference, causing cross talk. As a signal travels down a wire, it can reach the other end of the wire and reflect back along the length of the wire, as well.
以太网如何控制这些错误?最初的以太网标准在每个帧中包含一个 32 位循环冗余校验 (CRC),它可以检测传输中的大量错误,如第 2 章“数据传输问题和解决方案”中所述。然而,在更高的速度和光学(而不是电气)传输机制上,CRC 可能无法检测到足够的错误来影响协议的运行。为了提供更好的错误控制,后来(和更快)的以太网标准包含了更强大的错误控制机制。
How does Ethernet control for these errors? The original Ethernet standard included a 32-bit Cyclic Redundancy Check (CRC) in each frame, which can detect a large array of errors in transmission, as noted in Chapter 2, “Data Transport Problems and Solutions.” At higher speeds, and on optical (rather than electrical) transport mechanisms, however, CRC can fail to detect enough errors to impact the operation of the protocol. To provide better error control, later (and faster) Ethernet standards have included more robust error control mechanisms.
例如,千兆位以太网指定了8B10B编码方案,旨在确保发送器和接收器时钟的正确同步;该方案还可以检测一些位错误。十吉位以太网通常应用于硬件配备Reed-Solomon码纠错(EC)系统和16B18B编码系统,提供良好的前向纠错(FEC)和时钟同步,开销为18%。
For instance, Gigabit Ethernet specifies an 8B10B encoding scheme designed to ensure the correct synchronization of sender and receiver clocks; this scheme also detects some bit errors, as well. Ten-gigabit Ethernet is often implemented in hardware with a Reed-Solomon code Error Correction (EC) system and a 16B18B encoding system, which provides good Forward Error Correction (FEC) and clock synchronization with 18% overhead.
笔记
Note
8B10B 编码方案试图确保数据流中的 0 和 1 位数量大致相同,从而实现高效的激光利用并提供嵌入信号中的时钟同步。该方案的工作原理是将 8 位数据 (8B) 编码为在线路上传输的 10 个位 (10B),这意味着传输的每个字符大约有 25% 的开销。由于接收器知道应该接收多少个 0 和 1,因此可以检测并纠正单位奇偶校验错误。
The 8B10B encoding scheme attempts to ensure there are approximately the same number of 0 and 1 bits in a data stream, which allows for efficient laser utilization and provides for clock synchronization to be embedded in the signal. The scheme works by encoding 8 bits of data (8B) into 10 transmitted bits on the wire (10B), which means there is about 25% overhead for each character transmitted. Single bit parity errors can be detected and corrected because the receiver knows how many 0s and 1s should have been received.
以太网以数据包和帧的形式传输数据;数据包由前导信息、帧和任何尾部信息组成。帧包含由固定长度字段组成的标头和所承载的数据。图4-5说明了以太网数据包;框架也被标记出来。
Ethernet transmits data in packets and frames; the packet is made up of the preamble information, the frame, and any trailing information. The frame contains a header, which is made up of fixed length fields, and the data being carried. Figure 4-5 illustrates an Ethernet packet; the frame is marked out as well.
在图 4-5中,前导码包含帧开始标记、接收器可用于同步其时钟以与传入数据包同步的信息以及其他信息。目标地址是在前导码之后立即接收到的,因此接收方可以快速决定是否将此数据包复制到内存中。地址、协议类型和携带的数据都是帧的一部分。最后,所有 FEC 信息和其他尾部都会添加到帧中,以构成数据包的最后部分。
In Figure 4-5, the preamble contains a beginning of frame marker, information the receiver can use to synchronize its clock to synchronize to the incoming packet, and other information. The destination address is received immediately after the preamble, so the receiver can quickly decide whether to copy this packet into memory or not. The addresses, protocol type, and carried data are all part of the frame. Finally, any FEC information and other trailers are added onto the frame to make up the final section(s) of the packet.
类型字段特别令人感兴趣,因为它为下一层提供信息——提供数据中携带的信息的协议字段—识别协议。该信息对于以太网来说是不透明的——以太网芯片组不知道如何解释该信息(仅知道它在哪里)以及如何携带它。如果没有这个字段,就无法以一致的方式将所携带的数据分派到正确的上层协议,或者更确切地说,将多个上层协议正确地复用为以太网帧,然后正确地解复用。
The type field is of particular interest, as this provides the information for the next layer up—the protocol providing the information carried in the data field—to identify the protocol. This information is opaque to Ethernet—the Ethernet chipset does not know how to interpret this information (only where it is), and how to carry it. Without this field there would be no consistent way for the carried data to be dispatched to the correct upper-layer protocol, or rather, for multiple upper-layer protocols to be properly multiplexed into Ethernet frames, and then properly demultiplexed.
在以太网的原始 CSMA/CD 实现中,共享介质本身提供了一种基本的流量控制机制。假设没有两台主机可以同时传输,并且某些上层协议传输的信息必须至少偶尔得到确认或应答,则发送器必须定期休息以接收任何确认或应答。有时,这种相当粗略的流量控制形式不起作用;以太网规范假设某些更高层协议将控制信息流以防止这种情况下发生故障。
In the original CSMA/CD implementation of Ethernet, the shared medium itself provided a sort of basic flow control mechanism. Assuming no two hosts can transmit at the same time, and information transmitted by some upper-layer protocol must be acknowledged or answered at least occasionally, the transmitter must periodically take a break to receive any acknowledgment or reply. There are sometimes situations where this rather rough form of flow control does not work; the Ethernet specification assumes some higher layer protocol will control the flow of information to prevent failures in this case.
在交换式全双工以太网中,没有 CSMA/CD,因为没有共享介质。连接到这对传输通道的两台主机可以在线路允许的情况下以最快的速度发送数据。事实上,这可能会导致主机接收到的数据多于其处理能力的情况。为了解决这个问题,为以太网开发了暂停帧。当接收方发送暂停帧时,发送方应该在指定的时间内停止发送流量。
In switched full duplex Ethernet, there is no CSMA/CD, as there is no shared medium. The two hosts connected to the pair of transmission channels can send data as quickly as the wires permit. This can, in fact, result in a situation where a host receives more data than it can process. To resolve this, a pause frame was developed for Ethernet. When a receiver sends the pause frame, the sender is supposed to stop sending traffic for a specified period of time.
暂停帧并未广泛部署。
Pause frames are not widely deployed.
笔记
Note
许多协议并不包含第 3 章“网络传输建模”中描述的递归互联网架构 (RINA) 模型的全部四个功能”:错误控制、流量控制、传输和复用。即使在实现所有四种功能的协议中,也并不总是部署所有四种功能。通常,在这种情况下,协议和/或网络设计者会将功能移交给堆栈中的较低或较高层。这在某些情况下确实有效,但您应该始终小心地假设这是正确的做法。例如,逐跳加密和端到端加密之间存在差异。端到端对于加密的应用程序和协议来说是有好处的,但事实上,并不是每个应用程序都会加密正在传输的数据,也不是每个主机都配置了加密传输。在这些情况下,逐跳加密在不太安全的链路(例如无线连接)中非常有用。
Many protocols do not contain all four of the functions described as part of the Recursive Internet Architecture (RINA) model described in Chapter 3, “Modeling Network Transport”: error control, flow control, transport, and multiplexing. Even among those protocols implementing all four functions, all four are not always deployed. Normally, in this situation, the protocol and/or network designer is handing the function off to a lower or higher layer in the stack. This does work in some cases, but you should always be careful about assuming it is the correct thing to do. For instance, there is a difference between hop-by-hop encryption and end-to-end encryption. End-to-end is good for applications and protocols that do encrypt, but not every application does, in fact, encrypt data being transferred, nor does every host have an encrypted transport configured. In these cases, hop-by-hop encryption can be useful across less than secure links, such as wireless connections.
通常称为WiFi 802.11 并在市场上销售,广泛部署用于在未经许可的(美国)2.4 和 5GHz 无线电频谱中通过无线传输数据。微波炉、雷达系统、蓝牙、一些业余无线电系统,甚至婴儿监视器也使用 2.4GHz 无线电频谱,因此 WiFi 可能会干扰这些其他系统,也可能被这些其他系统干扰。
Commonly called and marketed as WiFi, 802.11, which is widely deployed for carrying data over wireless in the unlicensed (in the United States) 2.4 and 5GHz radio spectrums. Microwave ovens, RADAR systems, Bluetooth, some amateur radio systems, and even baby monitors also use the 2.4GHz radio spectrums, so WiFi can both interfere with and be interfered with by these other systems.
802.11 规范通常使用频率复用的形式在单个通道或一组频率上传送大量信息。信号的频率就是信号在一秒内切换极性的速率;因此,2.4GHz 信号是一种电信号,通过电线、光纤或空气传输,每秒切换极性,从正到负(或从负到正)2.4 × 10 9次。
The 802.11 specifications generally use a form of frequency multiplexing to carry a large amount of information across a single channel, or set of frequencies. The frequency of a signal is simply the rate at which the signal switches polarity within a single second; hence a 2.4GHz signal is an electrical signal, carried across either a wire, an optical fiber, or the air, that switches polarity, from positive to negative (or negative to positive) 2.4 × 109 times per second.
笔记
Note
这些只是最基本的描述;如果你愿意的话,你可以研究整个无线电和电波传播领域;这里的目标是为您提供足够的信息来理解基本概念,而不会让您不知所措。
These are bare minimum descriptions; there is an entire field of radio and wave propagation you can study if you are so inclined; the goal here is to give you enough information to understand the basic concepts without overwhelming.
要理解无线信令的概念,最好从载波和调制的概念开始;如图 4-6所示。
To understand the concept of wireless signaling, it is best to begin with the idea of carrier and modulation; Figure 4-6 illustrates.
图4-6中,选择单一中心频率;该通道将是该中心频率两侧的频率范围。在生成的信道中,选择两个载波频率,使它们彼此正交,这意味着这两个载波频率上承载的信号不会相互干扰。这些在图中被标记为OSF 1和OSF 2 。这些载波频率中的每一个实际上都是一个较窄的信道,允许将 0 和 1 的实际信号调制到该信道上。在这种情况下,调制意味着在每个 OSF 频率周围改变信号的实际频率。
In Figure 4-6, a single center frequency is chosen; the channel will be a range of frequencies on either side of this center frequency. Within the resulting channel, two carrier frequencies are chosen so they are orthogonal to one another—which means signals carried on these two carrier frequencies will not interfere with one another. These are marked as OSF 1 and OSF 2 in the figure. Each of these carrier frequencies is, in turn, actually a narrower channel, allowing the actual signal of 0s and 1s to be modulated onto the channel. Modulation, in this case, means varying the actual frequency of the signal around each OSF frequency.
笔记
Note
调制只是意味着以某种方式修改载波,从而允许携带信号,以便接收器可以可靠地对其进行解码。
Modulation simply means somehow modifying the carrier in a way that allows a signal to be carried so a receiver can reliably decoded it.
因此,802.11规范使用正交频分复用(OFDM)方案,并使用频率调制(FM)对实际数据进行编码。
Thus, the 802.11 specification uses an Orthogonal Frequency Division Multiplexing (OFDM) scheme, and encodes the actual data using Frequency Modulation (FM).
笔记
Note
关于多路复用的令人困惑的一点是它有两种含义,而不是一种。它要么意味着一次将多个位放置在同一介质上,要么意味着允许多个主机同时使用同一介质进行通信。这两种含义中的哪一种只能在特定的上下文中才能理解。在本节中,含义是第一个,将单个介质分成多个通道,以允许同时传输多个位。在本书其余部分的大部分内容中,它指的是第二种,允许多个主机通过同一介质传输数据。
One of the confusing points about multiplexing is it has two meanings, rather than one. Either it means to place multiple bits on the same medium at once, or it means allowing multiple hosts to communicate using the same medium at once. Which of these two meanings is intended can only be understood in a specific context. In this section, the meaning is the first, breaking a single medium up into channels to allow multiple bits to be transmitted at once. In most of the rest of this book, it means the second, allowing multiple hosts to transfer data over the same medium.
在此类系统上传输数据的速度(带宽)直接取决于每个通道的宽度以及发射机选择正交频率的能力。为了提高 802.11 的速度,应用了两种不同的技术。第一个是简单地增加通道宽度,以便可以使用更多的载波频率来承载数据。第二是找到更有效的方法,通过使用更复杂的调制方法将数据打包到单个通道中。例如,802.11b 可以在 2.4GHz 范围内使用 40MHz 宽的信道,而 802.11ac 可以在 5GHz 范围内使用 80 或 160MHz 宽的信道。
The speed at which data can be transmitted on such a system (the bandwidth) depends directly on the width of each channel and the ability of the transmitter to select orthogonal frequencies. To increase the speed of 802.11, then, two different techniques have been applied. The first is simply to increase the channel width, so more carrier frequencies can be used to carry data. The second is to find more efficient ways to pack data into a single channel by using more complex modulation methods. For instance, 802.11b can use a 40MHz wide channel in the 2.4GHz range, while 802.11ac can use either an 80 or 160MHz wide channel in the 5GHz range.
802.11 规范系列中还使用了其他形式的多路复用,以在两个设备之间获得更多带宽。802.11n 规范引入了多输入多输出 (MIMO) 天线阵列,允许信号在单一介质(空气)中遵循不同的路径。这看起来似乎不可能,因为房间里只有一种“空气”,但无线信号实际上会从房间内的不同物体上反弹,这导致它们在空间中采取多条路径。如图 4-7所示。
Other forms of multiplexing to gain more bandwidth between two devices are also used in the 802.11 specification series. The 802.11n specification introduced Multiple Input Multiple Output (MIMO) antenna arrays, which allow the signal to follow different paths through the single medium (air). This might seem impossible, as there is only one “air” in a room, but wireless signals actually bounce off different objects within a room, which causes them to take multiple paths through the space. Figure 4-7 illustrates.
在图 4-7中,假设发射机使用向各个方向发射的天线(全向天线),则在单个空间中存在三个路径,标记为 1、2 和 3。发射机和接收机无法“看到”三个独立的路径,但他们可以测量每对天线之间的信号强度,并尝试在明显分离的对之间发送不同的信号,直到找到可以发送不同数据集的多条路径。
In Figure 4-7, assuming the transmitter is using an antenna that will transmit in all directions (an omnidirectional antenna), there are three paths through the single space, labeled 1, 2, and 3. The transmitter and receiver cannot “see” the three separate paths, but they can measure the strength of signal between each pair of antennas, and try sending different signals between apparently separated pairs until they find multiple paths over which different sets of data can be sent.
使用多个天线的第二种方式是波束成形。通常,从天线发射的无线信号覆盖一个圆(三维空间中的球,但这很难有意义地说明)。在波束成形中,使用多种技术之一对波束进行整形,使其更加椭圆形。图 4-8说明了这些概念。
A second way multiple antennas can be used is in beamforming. Normally, a wireless signal transmitted from an antenna covers a circle (a ball in three dimensions, but this is difficult to meaningfully illustrate). In beamforming, the beam is shaped using one of various techniques to make it more oblong. Figure 4-8 illustrates these concepts.
在未成形的模式中,信号大致呈围绕天线尖端的球体或地球仪;从顶部绘制,它看起来很像一个简单的圆圈,延伸到球形的最远点。通过使用反射器,光束可以被整形或形成为更椭圆形的形状。反射器后面和光束侧面的空间将接收较少的(对于非常紧密的光束,甚至没有)传输功率。如何建造这样的反射器?最简单的方法是使用物理屏障来排斥信号的功率,就像镜子反射光或墙壁反射声音一样。关键是在传输信号中放置物理屏障的点。图4-9将用于解释波形、反射和抵消中的要点。
In the unformed pattern, the signal is roughly a ball or globe around the tip of the antenna; drawn from the top, it looks much like a simple circle extending to the farthest point in the ball shape. By using a reflector, the beam can be shaped, or formed, into a more oblong shape. The space behind the reflector, and to the sides of the beam, will receive less (or even none, for very tight beams) of the transmission power. How can such a reflector be built? The simplest way is with a physical barrier tuned to repel the signal’s power, much like a mirror reflects light, or a wall reflects sound. The key is the point in the transmission’s signal the physical barrier is placed. Figure 4-9 will be used to explain the key points in the waveform, reflection, and cancellation.
典型的波形遵循正弦波,从零功率开始,增加到最大正功率,然后返回到零功率,然后经历正负功率周期。其中每一个都是一个循环;频率是指该循环每秒重复的次数。波在空间中沿着电线或光纤的整个长度称为波长。波长与频率成反比;频率越高,波长越短。
A typical waveform follows a sine wave, which begins at zero power, increases to its maximum positive power, then moves back to zero power, and then through a positive negative power cycle. Each of these is a cycle; the frequency refers to the number of times this cycle repeats per second. The entire length of the wave in space, along a wire, or an optical fiber, is called the wavelength. The wavelength is inversely proportional to the frequency; the higher the frequency, the shorter the wavelength.
该图中需要注意的关键点是四分之一和二分之一波长点处的信号状态。在四分之一波点,信号处于最高功率;如果此时有物体或其他信号发生干扰,信号将被吸收或反射。在半波点,信号功率最小;如果信号上没有偏移或恒定电压,则信号将达到零功率。然后,为了反射信号,您可以放置一个物理对象,使其仅反射四分之一波点处的功率。当然,执行此操作所需的物理距离取决于频率,就像波长取决于频率一样。
The key point to note in this diagram is the state of the signal at the quarter and half wavelength points. At the quarter wave point, the signal is at its highest power; if an object, or another signal, interferes at this point, the signal will either be absorbed or reflected. At the half wave point, the signal is at the minimum power; if there is no offset, or constant voltage on the signal, the signal will reach zero power. To reflect a signal, then, you can position a physical object so it reflects the power just at the quarter wave point. The physical distance required to do this will, of course, depends on the frequency, just as the wavelength depends on the frequency.
物理反射器很容易;如果您希望能够在不使用物理反射器的情况下动态形成光束怎么办?图 4-10说明了您可以在此处使用的原则。
Physical reflectors are easy; what if you want to be able to dynamically form the beam without using a physical reflector? Figure 4-10 illustrates the principles you can use here.
图4-10中的浅灰色虚线提供了相位标记;如果两个信号的峰值对齐,则它们同相,如左图所示。中间显示的两个信号有四分之一的相位差,因为一个信号的峰值与第二个信号的零点或最小值对齐。最右侧显示的第三对信号是互补的,即 180 度异相,因为一个信号的正峰值与第二个信号的负峰值对齐。第一对信号将相加;第三对信号将抵消。如果制作正确,第二对可能会相互反射。这三种效应使得光束得以形成,如图4-11所示。
The light gray dotted lines in Figure 4-10 provide a phase marker; two signals are in phase if their peaks are aligned, as shown on the left. The two signals shown in the middle are a quarter out of phase, as the peak of one signal is aligned with the zero point, or minimum, of the second signal. The third pair of signals, shown on the far right, are complementary, or 180 degrees out of phase, as the positive peak of one signal aligns with the negative peak of the second signal. The first pair of signals will add together; the third pair of signals will cancel out. The second pair may, if correctly crafted, reflect off one another. These three effects allow a beam to be formed, as shown in Figure 4-11.
单个波束形成系统可能会也可能不会使用所有这些组件,但总体思路是将波束限制在介质内的物理空间内——通常是自由空气传播。波束成形允许共享物理介质用作多个不同的通信通道,如图4-12所示。
A single beamforming system may, or may not, use all of these components, but the general idea is to restrict the beam within a physical space within the medium— generally free air propagation. Beamforming allows the shared physical medium to be used as several different communication channels, as shown in Figure 4-12.
在图 4-12中,无线路由器使用其波束成形功能形成三个不同的波束,每个波束指向不同的主机。与将整个空间视为单个共享介质相比,路由器现在可以以更高的速率在所有三个形成的波束上发送流量,因为发送到 A 的信号不会干扰或重叠发送到 B 或 C 的信息。
In Figure 4-12, the wireless router has used its beamforming capabilities to form three different beams, each directed at a different host. The router can now send traffic on all three of these formed beams at a higher rate than if it treated the entire space as a single shared medium, because the signals to A will not interfere or overlap with the information transmitted to B or C.
波束成形等定向方法只能改善单向传输的流量。例如,如果无线接入点能够进行波束成形,而与之通信的主机则不能,则两个设备能够通信的距离将受到主机的限制,因为它无法发送定向信号。然而,物理距离并不总是波束成形技术的重点。无线信号所能携带的信息量与功率等因素有关;接收器接收到的功率越多(未发送、接收),可以发送的信息就越多。因此,如果接入点可以形成波束,那么它向主机提供的功率是主机连接回接入点的功率的两倍,它将提高主机通过无线链路下载数据的速度。那么,在无线连接的一端(而不是另一端)进行波束成形可能是值得的。是否如此的答案一如既往地是“这取决于”应用程序、流量模式和许多其他因素。
Directional methods such as beamforming only improve the traffic transmitted in one direction. For instance, if a wireless access point is capable of beamforming, and the host it is communicating with is not, the distance the two devices will be able to communicate across will be constrained by the host, as it cannot send a directional signal. However, the physical distance is not always the important point in beamforming technologies. The amount of information a wireless signal can carry is related to the power as well as other factors; the more power received at the receiver (not transmitted, received), the more information that can be transmitted. So if the access point can form a beam so it has twice as much power to the host as the host has connecting back to the access point, it will increase the speed at which the host can download data across the wireless link. It may, then, be worthwhile to have beamforming on one end of the wireless connection (and not the other). The answer to whether or not it is, is, as always, “it depends”—on the application, traffic pattern, and many other factors.
无线信号中的多路复用问题涉及共享单个通道,就像有线网络系统中的那样。设计用于共享单一无线介质的解决方案主要存在两个具体问题:隐藏节点问题和传输/接收功率问题(有时也称为接收器淹没)。图4-13说明了隐藏节点问题。
The multiplexing problem in wireless signals involves sharing a single channel, much like in wired network systems. Two specific problems dominate the solutions designed to share a single wireless medium: the hidden node problem and the transmission/reception power problem (which is also sometimes called receiver swamping). Figure 4-13 illustrates the hidden node problem.
图4-13中的三个圆圈代表A、B和C处无线发射器的三个重叠范围。如果A向B发射信号,C就听不到发射信号。即使 C 侦听空闲信道,A 和 C 也可能同时传输,从而导致 B 发生冲突。
The three circles in Figure 4-13 represent the three overlapping ranges of the wireless transmitters at A, B, and C. If A transmits toward B, C cannot hear the transmission. Even if C listens for a clear channel, it is possible for A and C to transmit at the same time, causing a collision at B.
由于传输功率与接收信号功率的关系以及空气作为媒介的现实,隐藏节点问题变得更加严重。空气中无线电信号强度的一个好的经验法则是,信号在其传播的每个波长的空间中都会损失一半的功率。在高频下,信号很快失去强度,这意味着发射器必须以比接收器能够接收的功率数量级大的功率发送信号。
The hidden node problem is made worse because of the power of transmission versus the power of the received signal, and the reality of air as a medium. A good rule of thumb for radio signal strength in air is the signal loses half of its power every wavelength of space it travels. At high frequencies, signals lose their strength very quickly, which means the transmitter must send a signal at a power orders of a magnitude larger than its receiver is capable of receiving.
构建一个能够在不破坏接收电路的情况下全强度“监听”本地传输信号,同时还能够“监听”扩展设备范围所需的极低功率信号的接收器是非常困难的。换句话说,发射器用足够的功率淹没接收器,在许多情况下足以摧毁接收器。这使得在无线网络中,发射器不可能在信号传输时监听信号,因此使得以太网中使用的冲突检测机制(例如)无法实现。
It is very difficult to build a receiver able to “listen to” the local transmit signal at full strength without destroying the receive circuitry while also being able to “hear” the very low power signals required to extend device range. The transmitter, in other words, swamps the receiver with enough power to destroy the receiver in many situations. This makes it impossible, in a wireless network, for a transmitter to listen to the signal as it is being transmitted, and hence makes the collision detection mechanism used in Ethernet (for instance) impossible to implement.
802.11 用于在多个发送器之间共享单个信道的机制必须避免隐藏信道和接收器淹没问题。802.11 WiFi 使用载波侦听多路访问/冲突避免 (CSMA/CA) 来协商信道使用。CSMA/CA 与 CSMA/CD 类似:
The mechanism used by 802.11 to share a single channel among multiple transmitters must avoid the hidden channel and receiver swamping problems. 802.11 WiFi uses Carrier Sense Multiple Access/Collision Avoidance (CSMA/CA) to negotiate channel usage. CSMA/CA is similar to CSMA/CD:
1. 在发送之前,发送者会监听以确定是否有其他设备正在发送。
1. Before transmitting, the sender listens to determine if another device is transmitting.
2. 如果听到另一次传输,发送者会后退一段随机时间,然后再次尝试;这种随机退避旨在防止多个设备听到相同的传输,并在未来的某个时间点同时尝试再次传输。
2. If another transmission is heard, the sender backs off for some random period of time before attempting again; this random backoff is designed to prevent several devices from hearing the same transmission, and all trying to transmit again at the same time at some point in the future.
3. 如果没有听到其他传输,则发送方传输整个帧;发送方不可能接收到它正在传输的信号,因此此时无法检测到冲突。
3. If no other transmission is heard, the sender transmits the entire frame; it is impossible for the sender to receive the signal it is transmitting, so there is no way to detect a collision at this point.
4. 接收方收到帧后发送确认信息;如果发送方没有收到确认,它将假定发生了冲突,后退一段随机时间,然后重新发送帧。
4. The receiver sends an acknowledgment for the frame on receipt; if the sender does not receive an acknowledgment, it will assume a collision has occurred, back off for a random amount of time, and resend the frame.
一些 WiFi 系统还可以使用请求发送/清除发送 (RTS/CTS) 系统。在这种情况下:
Some WiFi systems can also use a Request to Send/Clear to Send (RTS/CTS) system. In this case:
1. 发送方发送RTS。
1. The sender transmits an RTS.
2. 当信道空闲且没有安排其他传输时,接收方发送 CTS。
2. When the channel is clear, and no other transmission is scheduled, the receiver sends a CTS.
3. 发送方收到 CTS 后,发送数据。
3. On receiving the CTS, the sender transmits the data.
哪个系统将产生更高的带宽取决于使用该信道的发送器和接收器的数量、帧的长度以及其他因素。
Which system will produce higher bandwidth depends on the number of senders and receivers using the channel, the length of the frames, and other factors.
802.11 中的数据编组类似于以太网;每个数据包中有一组固定长度的标头字段,后面是传输的数据,最后是四个八位字节的帧校验序列 (FCS),其中包含数据包内容的 CRC。如果接收方可以根据 CRC 信息纠正错误,则它会这样做;否则,接收方根本不会确认帧的接收,这将导致发送方重传该帧。
Data marshaling in 802.11 is similar to Ethernet; there is a set of fixed length header fields in each packet, followed by the transported data, and finally a four-octet Frame Check Sequence (FCS), which contains a CRC over the contents of the packet. If the receiver can correct an error based on the CRC information, it will do so; otherwise, the receiver simply does not acknowledge receipt of the frame, which will lead to the frame being retransmitted by the sender.
每个帧中还包含一个序列号,以确保按照数据包传输的顺序接收和处理数据包。RTS/CTS 系统中的流量控制是通过接收器等待发送 CTS 直到它有足够的空闲缓冲区空间来接收新数据包来提供的。
A sequence number is included in each frame, as well, to ensure packets are received and processed in the order in which they were transmitted. Flow control is provided in the RTS/CTS system by the receiver waiting to send a CTS until it has enough clear buffer space to receive a new packet.
较低层传输协议往往以物理问题为主,例如主机如何知道何时访问通道,以及如何最有效地使用通道?这四个要素仍然是需要考虑的重要因素。例如,多路复用仍然需要地址来确定特定帧正在传输到哪个主机。换句话说,多路复用包含寻址和其他解决方案,旨在解决仅在与物理通道交互时发现的问题。
Lower layer transmission protocols tend to be dominated by physical concerns, such as how can a host know when to access the channel, and how can the channel be most efficiently used? The four elements are still important to consider. Multiplexing, for instance, still requires addresses to determine which host a particular frame is being transmitted to. In other words, multiplexing contains addressing and other solutions designed to solve problems found only when interacting with physical channels.
这些较低层协议中的许多解决方案也被认为是“第一道防线”,而不是“唯一的防线”。物理层中的错误控制往往比高层中实现的机制更简单,这意味着这些机制的检查速度更快,但也可能允许通过需要在更高层检测和纠正的一些错误。在这些层中,流量控制侧重于控制单个链路上的流量,并且通常是通道访问的副作用,而不是显式的控制机制。
Many of the solutions in these lower layer protocols are also assumed to be a “first line of defense,” rather than “the only line of defense.” Error control in the physical layer tends to be simpler than mechanisms implemented in higher layers, which means these mechanisms are faster to check, but may also allow through some number of errors that need to be detected and corrected at some higher layer. Flow control, in these layers, is focused on controlling traffic across a single link, and is often a side effect of channel access, rather than an explicit control mechanism.
总体而言,物理层距离应用程序最远,并且通常受到网络设计人员最少的关注;然而这些协议仍然很重要,并且它们仍然遵循更高级别协议所采用的相同问题和解决方案模式。
Overall, the physical layer is the farthest from the application, and often gains the least amount of attention of the network designer; yet these protocols are still important, and they still follow the same problem and solution patterns that higher level protocols employ.
柯雷亚、柯尔特、查尔斯·M·科齐罗克、罗伯特·B·博特赖特、杰弗里·奎内尔和鲍勃·梅特卡夫。汽车以太网——权威指南。无畏控制系统,2014。
Correa, Colt, Charles M. Kozierok, Robert B. Boatright, Jeffrey Quesnelle, and Bob Metcalfe. Automotive Ethernet—The Definitive Guide. Intrepid Control Systems, 2014.
Gast,Matthew S. 802.11 无线网络:权威指南。第二版。北京; 法纳姆:O'Reilly Media,2005。
Gast, Matthew S. 802.11 Wireless Networks: The Definitive Guide. 2nd edition. Beijing; Farnham: O’Reilly Media, 2005.
———。802.11n:生存指南:100 Mbps 以上的 Wi-Fi。第一版。加利福尼亚州塞瓦斯托波尔:O'Reilly Media,2012 年。
———. 802.11n: A Survival Guide: Wi-Fi Above 100 Mbps. 1st edition. Sebastopol, CA: O’Reilly Media, 2012.
盖尔、吉姆. 设计和部署 802.11 无线网络:为企业应用实施 802.11n 和 802.11ac 无线网络的实用指南。第二版。印第安纳州印第安纳波利斯:思科出版社,2015 年。
Geier, Jim. Designing and Deploying 802.11 Wireless Networks: A Practical Guide to Implementing 802.11n and 802.11ac Wireless Networks for Enterprise-Based Applications. 2nd edition. Indianapolis, IN: Cisco Press, 2015.
麦克劳克林、史蒂文·W.和大卫·沃兰。“错误控制编码和以太网。” 于 2017 年 7 月 10 日在俄勒冈州波特兰市的 IEEE 802.3 EFM 研究组上发表。http://www.ieee802.org/3/efm/public/jul01/presentations/mclaughlin_1_0701.pdf。
McLaughlin, Steven W., and David Warland. “Error Control Coding and Ethernet.” presented at the IEEE 802.3 EFM Study Group, Portland, OR, July 10, 2017. http://www.ieee802.org/3/efm/public/jul01/presentations/mclaughlin_1_0701.pdf.
罗伯特·M·梅特卡夫和大卫·R·博格斯。“以太网:本地计算机网络的分布式数据包交换。” ACM 19 的通信,编号。7(1976 年 7 月):395-404。
Metcalfe, Robert M., and David R. Boggs. “Ethernet: Distributed Packet Switching for Local Computer Networks.” Communications of the ACM 19, no. 7 (July 1976): 395–404.
佩拉西亚、埃尔达德和罗伯特·史黛西。下一代无线 LAN:802.11n 和 802.11ac。第二版。英国剑桥:剑桥大学出版社,2013 年。
Perahia, Eldad, and Robert Stacey. Next Generation Wireless LANs: 802.11n and 802.11ac. 2nd edition. Cambridge, UK: Cambridge University Press, 2013.
波特、布鲁斯和鲍勃·弗莱克。802.11 安全。第一版。加利福尼亚州塞瓦斯托波尔:O'Reilly Media,2002 年。
Potter, Bruce, and Bob Fleck. 802.11 Security. 1st edition. Sebastopol, CA: O’Reilly Media, 2002.
鲁齐克,让-皮埃尔·勒。IEEE 802.11ac:标准分析。CreateSpace独立出版平台,2013年。
Rouzic, Jean-Pierre Le. IEEE 802.11ac: An Analysis of the Standard. CreateSpace Independent Publishing Platform, 2013.
司布真、查尔斯·E.和乔安·齐默尔曼。以太网交换机:交换机网络设计简介。第一版。北京:奥莱利传媒,2013。
Spurgeon, Charles E., and Joann Zimmerman. Ethernet Switches: An Introduction to Network Design with Switches. 1st edition. Beijing: O’Reilly Media, 2013.
———。以太网:权威指南:设计和管理局域网。第二版。北京:奥莱利传媒,2014。
———. Ethernet: The Definitive Guide: Designing and Managing Local Area Networks. 2nd edition. Beijing: O’Reilly Media, 2014.
特罗珀、卡尔. 本地计算机网络技术。爱思唯尔,2014。
Tropper, Carl. Local Computer Network Technologies. Elsevier, 2014.
1. 光调制与本章中描述的电调制有何不同或相似之处?
1. How is optical modulation different from, or similar to, the electrical modulation described in the chapter?
2. 在无线电波上,您通常会使用不同频率的不同信道通过单一介质(例如无线网络中的空气,甚至某些有线网络中的电线)传输多个信号。使用什么机制来“通道化”光传输介质?它是如何工作的?
2. On radio waves, you would normally use different channels at different frequencies to carry multiple signals over a single medium (such as the air in wireless networks, and even over wires in some wired networks). What mechanism is used to “channelize” an optical transmission media, and how does it work?
3. 本章指出,在解决数据传输中的其他问题之前,通常必须先解决多路复用问题。为什么会这样?
3. The chapter states multiplexing must often be solved before other problems in data transmission can be addressed. Why might this be?
4. 以太网最初设计为通过细同轴电缆和粗同轴电缆(10BASE5 和 10BASE2)运行。是否可以通过同轴电缆启用全双工操作?为什么或者为什么不?
4. Ethernet was originally designed to operate over thin and thick coax cable (10BASE5 and 10BASE2). Is it possible to enable full duplex operation over coax? Why or why not?
5. 本章指出暂停帧并未在以太网中广泛部署。在什么情况下需要暂停帧,为什么它不再被广泛部署?
5. The chapter notes the pause frame is not widely deployed for Ethernet. Under what conditions would a pause frame be needed, and why would it not be widely deployed any longer?
6. 鉴于音频也是一种穿过空气的波,不同相位信号之间的混合、相乘和取消交互也可能以类似的方式应用于音频工程,这是有道理的。找到一个网站来解释音频设计中这些相同问题的影响,并描述音频工程师用来解决这些问题的一些解决方案。
6. Given audio is also a wave passing through air, it would make sense that the mix, multiply, and cancel interactions between signals in a different phase might also apply to audio engineering in a similar way. Find a site that explains the impacts of these same problems in audio design, and describe some solutions audio engineers use to solve these problems.
7. 建立定向信号的方法有很多种。例如,碟形天线如何塑造无线信号?常用的波束成形天线类型是对数周期偶极子 (LPD)。这种天线是如何工作的?这些类型的天线在无线网络中有用吗?
7. There are many ways to build directional signals. For instance, how does a dish-type antenna shape a wireless signal? A commonly used beamforming antenna type is the Log Periodic Dipole (LPD). How does this kind of antenna work? Would these kinds of antennas ever be useful in wireless networking?
8. 查找某些有线和无线链路类型的吞吐量。它们有根本不同吗?你能解释一下为什么吗?
8. Find the goodput for some wired and wireless link types. Are they radically different? Can you explain why?
9. 以太网规范旨在允许全球制造的每台设备具有不同的 MAC 地址。本章提到了旧版本和设备,需要网络操作员手动配置以太网设备的地址。您能否从状态、优化和表面方面描述这两个选项之间的权衡?尝试找出解决问题的每种可能方法的积极和消极方面。
9. The Ethernet specifications are designed to allow every device manufactured, worldwide, to have different MAC addresses. The chapter mentions older versions, and equipment, that required the network operator to configure Ethernet equipment with addresses manually. Can you describe, in terms of state, optimization, and surfaces, the tradeoffs between these two options? Try to find both positive and negative aspects of each possible way of solving the problem.
10. 本章指出以太网芯片组将被分配至少一个地址,这意味着某些芯片组可能被分配多个地址。描述至少两种使用情况,其中单个芯片需要具有多个 MAC 地址。
10. The chapter states an Ethernet chipset will be assigned at least one address, implying some chipsets may be assigned more than one. Describe at least two use cases where a single chip would need to have more than one MAC address.
前一章考虑了物理介质上点对点数据传输的两个示例,而本章将考虑端到端数据传输的四个示例。图 5-1以递归互联网架构 (RINA) 的形式进行说明。
While the previous chapter considered two examples of point-to-point data transport over physical media, this chapter will consider four examples of end-to-end data transport. Figure 5-1 illustrates in terms of the Recursive Internet Architecture (RINA).
当然,并非每个传输协议都精确映射到 RINA 中的单个功能层,但这种映射足够接近,因此很有用。要记住的主要一点是,对于每种传输协议,您可以问四个问题:
Not every transport protocol maps precisely to a single functional layer in RINA, of course, but the mapping is close enough to be useful. The primary point to remember is—for each transport protocol, there are four questions you can ask:
• 协议如何提供传输,或者如何编组数据?
• How does the protocol provide transport, or how does it marshal data?
• 协议如何提供多路复用服务,或者在单个共享资源上承载多个数据流的能力?
• How does the protocol provide multiplexing services, or the ability to carry multiple streams of data on a single shared resource?
• 协议如何提供错误控制,不仅包括错误检测,还包括解决错误——通过重传或提供足够的信息来重建原始信息?
• How does the protocol provide error control, which should include not only error detection, but also resolving errors—either through retransmission or providing enough information to rebuild the original information?
• 协议如何提供流量控制?
• How does the protocol provide for flow control?
这些问题中的每一个都可以有许多子问题,例如发现最大传输单元(MTU)、提供多播数据包的复制等。
Each of these questions can have a number of subquestions, such as discovering the Maximum Transmission Unit (MTU), providing for replication of packets for multicast, etc.
本章将考虑四种协议:
This chapter will consider four protocols:
• 互联网协议(IP),提供第二对层的下半部分。IP 的主要关注点在于多路复用的寻址方案以及跨许多不同物理传输系统提供单一传输的能力。
• The Internet Protocol (IP), which provides the bottom half of the second pair of layers. The primary focuses of IP are in the addressing scheme for multiplexing, and the ability to provide a single transport across many different physical transport systems.
• 传输控制协议(TCP),提供第二对层的上半部分的一个版本。TCP 提供错误和流量控制,以及为在 TCP 之上运行的应用程序和其他协议携带多路复用信息的位置。
• The Transmission Control Protocol (TCP), which provides one version of the top half of the second pair of layers. TCP provides error and flow control, as well as a place to carry multiplexing information for applications and other protocols that run on top of TCP.
• 快速用户数据报协议互联网连接(QUIC),它提供第二对层的上半部分的另一个版本。QUIC 在功能上与 TCP 非常相似,但在运行方式上与 TCP 有一些显着差异。
• Quick User Datagram Protocol Internet Connections (QUIC), which provides another version of the top half of the second pair of layers. QUIC is much like TCP in its function, but has some significant differences from TCP in the way it operates.
• 互联网控制消息协议(ICMP)。
• The Internet Control Message Protocol (ICMP).
互联网协议 (IP) 最初记录在20 世纪 70 年代中期的一系列称为IEN的互联网协议规范文档中,大部分由 Jonathan B. Postel 编写。这些文档描述了一种称为 TCP 的协议,该协议最初部署时包含 IP 和 TCP 两种协议中包含的功能。Postel 指出,单一协议中的这种功能组合并不是一件好事。在 IEN #2 中,他指出:
The Internet Protocol (IP) was originally documented in a series of Internet Protocol Specification documents called IENs in the middle of the 1970s, mostly written by Jonathan B. Postel. These documents described a protocol called TCP, which, when it was originally deployed, included the functionality contained in two protocols, IP and TCP. Postel noted this combination of functionality in a single protocol was not a good thing; in IEN #2, he states:
我们违反了分层原则,搞砸了互联网协议的设计。具体来说,我们尝试使用 TCP 来做两件事:充当主机级端到端协议,以及充当互联网打包和路由协议。这两件事应该以分层和模块化的方式提供。我建议需要一种新的独特的互联网协议,并且严格使用 TCP 作为主机级端到端协议。我还相信,如果 TCP 仅以这种更简洁的方式使用,它可以得到一定程度的简化。还必须指定第三项——互联网主机到主机协议和互联网逐跳协议之间的接口。1
We are screwing up in our design of internet protocols by violating the principle of layering. Specifically we are trying to use TCP to do two things: serve as a host level end to end protocol, and serve as an internet packaging and routing protocol. These two things should be provided in a layered and modular way. I suggest that a new distinct internetwork protocol is needed, and that TCP be used strictly as a host level end to end protocol. I also believe that if TCP is used only in this cleaner way it can be simplified somewhat. A third item must be specified as well—the interface between the internet host to host protocol and the internet hop by hop protocol.1
IEN #28 于 1978 年 2 月发布,指定了这一新互联网协议的第 2 版。2 1978 年 6 月,该标准很快被 IEN #48 取代,3并于 1978 年 9 月再次由 IEN #54 发布。4 1980 年 1 月,随着 RFC760 的发布,IP 成为 IETF 协议,也称为 IEN #128,5并更新为当前规范 RFC791,1981 年 9 月发布。6至此,至今仍在使用的 IP 版本 4 (IPv4) 标头的格式已就位。
IEN #28, published in February of 1978, specified version 2 of this new Internet Protocol.2 This was quickly replaced by IEN #48 in June 1978,3 and again by IEN #54 in September of 1978.4 In January 1980, IP became an IETF protocol with the publication of RFC760, which was also known as IEN #128,5 and was updated with the current specification, RFC791, in September of 1981.6 At this point, the format of the IP version 4 (IPv4) header still in use today was in place.
笔记
Note
本书并未深入讨论 IPv4;虽然它被广泛部署,但我们将考虑 IP 协议的第 6 版,因为这是工程师将来可能会更频繁地遇到的协议。本着这种精神,本书中的所有示例也将使用版本 6 格式的地址。“进一步阅读”部分列出了希望了解更多有关 IPv4 的读者感兴趣的资源。
IPv4 is not covered in depth in this book; while it is widely deployed, version 6 of the IP protocol will be considered instead, as this is the protocol engineers will likely encounter more often in the future. In this spirit, all the examples in this book will use addresses in the version 6 format, as well. The “Further Reading” section lists resources of interest to readers who wish to learn more about IPv4.
IPv4 地址空间是一个 32 位无符号整数,这意味着它可以对 2 32个设备(约 42 亿个设备)进行编号或寻址。这听起来很多,但由于以下几个原因,现实却大不相同:
The IPv4 address space is a 32-bit unsigned integer, which means it can number, or address, 232 devices—about 4.2 billion devices. This sounds like a lot, but the reality is far different for several reasons:
• 每个地址代表一个接口,而不是一台设备。事实上,IP地址经常被用来代表一项服务,或者一台虚拟主机(或机器),这意味着一台设备通常会消耗多个IP地址。
• Each address represents one interface, rather than one device. In fact, IP addresses are often used to represent a service, or a virtual host (or machine), which means a single device will often consume more than one IP address.
• 聚合过程中浪费了大量地址。
• Large numbers of addresses are wasted in the process of aggregation.
在 20 世纪 90 年代初,互联网的 IPv4 地址空间中的地址即将耗尽的情况变得越来越明显。图 5-2所示的图表显示了从 20 世纪 90 年代中期开始随时间变化的可用 IPv4 地址空间。7
In the early 1990s, it became obvious the Internet was going to run out of addresses in the IPv4 address space; charts like the one shown in Figure 5-2 show the available IPv4 address space over time starting in the mid-1990s.7
解决这种情况的简单方法是扩展 IPv4 地址空间以包含更多数量的设备,但 IPv4 协议在现场的经验导致互联网工程任务组 (IETF) 承担了一项更大的任务:重新设计IPv4。替代工作于 1990 年开始,第一稿于 1998 年达到标准状态。IPv6 地址空间包含 2 128 个地址,即大约 3.4 ×10 38。
The easy solution to this situation would have been to extend the IPv4 address space to encompass some larger number of devices, but experience with the IPv4 protocol in the field led the Internet Engineering Task Force (IETF) to take on a larger task: to redesign IPv4. The work on the replacement began in 1990, with the first drafts achieving standard status in 1998. The IPv6 address space contains 2128 addresses, or around 3.4 ×1038.
IPv6 旨在为多种不同协议(例如 TCP 和 QUIC)提供服务,这些协议将在本章后面的部分中讨论。因此,IPv6 仅提供通过网络传输数据所需的四种服务中的两种:传输(包括编组数据)和多路复用。以下各节将更详细地讨论这两个函数。
IPv6 is designed to provide services for several different protocols, such as TCP and QUIC, which are discussed in later sections in this chapter. As such, IPv6 provides only two services of the four required to carry data through a network: transport, which includes marshaling data, and multiplexing. These two functions are discussed in greater detail in the following sections.
IP 提供了一个“基础层”,可以在多种不同类型的物理链路上运行各种高层协议。为此,IP 必须解决两个问题:
IP provides a “base layer” on which a wide array of higher layer protocols run, on many different kinds of physical links. To do so, IP must solve two problems:
• 在许多不同的物理层和较低层协议上运行,同时向较高层提供一组一致的服务
• Run on a lot of different physical and lower layer protocols while presenting a consistent set of services to higher layers
• 适应下层提供的各种帧大小 为了创建一个可以运行所有上层协议的单一协议,IP 必须“适应”许多不同种类的物理层协议的帧类型。
• Adapt to the wide variety of frame sizes provided by lower layers To create a single protocol on which all upper layer protocols can run, IP must “fit into” the frame type of many different kinds of physical layer protocols.
一系列草案描述了如何在特定物理层之上运行 IP,包括 MPEG-2 网络、8异步传输模式,9个光网络,10 点对点协议(PPP),11电视中的垂直消隐间隔 (VBI),12光纤分布式数据接口 (FDDI),13架航空母舰,14和许多其他物理层协议(请参阅下面的“进一步阅读”部分)。这些草案主要解决了如何在底层物理层的帧(或数据包)中携带IP数据报(或数据包),以及如何启用层间发现,例如地址解析协议(ARP)在每种媒体类型上工作(有关详细信息,请参阅第 6 章“层间发现”)。
A series of drafts describe how to run IP on top of a particular physical layer, including MPEG-2 networks,8 Asynchronous Transfer Mode,9 optical networks,10 Point-to-Point Protocol (PPP),11 the Vertical Blanking Interval (VBI) in television,12 Fiber Distributed Data Interface (FDDI),13 avian carriers,14 and a number of other physical layer protocols (see the “Further Reading” section below). These drafts largely work out how to carry an IP datagram (or packet) in the frame (or packet) of the underlying physical layer, and how to enable interlayer discovery, such as the Address Resolution Protocol (ARP) to work on each media type (see Chapter 6, “Interlayer Discovery,” for more information).
IP 还必须指定如何在不同类型的物理链路上可用的各种 MTU 之间传输大数据块。虽然最初的以太网规范选择了 1,500 个八位位组的 MTU 来平衡大数据包大小和最大通道利用率,但许多其他物理层已设计为具有更大的 MTU。此外,应用程序不倾向于以整齐的、MTU 大小的块发送信息。IP 通过提供分段来解决这两个问题;如图 5-3所示。
IP must also specify how to carry large blocks of data across the various MTUs available on different kinds of physical links. While the original Ethernet specification chose an MTU of 1,500 octets to balance between large packet sizes and maximum channel utilization, many other physical layers have been designed with larger MTUs. Further, applications do not tend to send information in neat, MTU-sized chunks. IP manages these two problems by providing for fragmentation; Figure 5-3 illustrates.
如果应用程序(或更高级别的协议)将 2,000 个八位字节的数据传递给 IP,则 IP 实现将
If an application (or higher-level protocol) passes 2,000 octets of data to be transmitted to IP, the IP implementation will
• 确定数据必须传输的路径上的MTU;这通常是读取系统软件设置的配置值或默认值
• Determine the MTU along the path through which the data must be transmitted; this is normally a matter of reading a configured or default value set by the system software
• 根据 MTU 减去标头(包括隧道标头等)的预计大小(必须与数据一起传输的元数据),将信息分解为多个片段
• Break up the information into multiple fragments, based on the MTU minus the projected size of the headers, including tunnel headers, etc.—the metadata that must be transmitted along with the data
• 发送带有 IPv6 可选报头的第一个片段(这意味着片段报头不需要包含在不是较大数据块片段的数据包中)
• Send the first fragment with an IPv6 optional header (which means the fragment header does not need to be included with packets that are not fragments of a larger data block)
• 将更多片段标头中的偏移量设置为该数据包表示的原始数据块中的第一个八位字节除以8;在图 5-3的示例中,第一个数据包的偏移量为 0,而第二个数据包的偏移量为 150 (1200/8)。
• Set the offset in the more fragments header to the first octet in the original data block this packet represents divided by 8; in the example in Figure 5-3, the first packet has an offset of 0, while the second has an offset of 150 (1200/8).
• 如果这是数据块的最后一个片段,则将更多片段位设置为0;如果后面还有更多片段,则将更多片段位设置为1。
• Set the more fragments bit to 0 if this is the last fragment of the data block, and 1 if there are more fragments to follow.
IPv6 可以承载分段的总数据块的大小受到偏移字段大小的限制,偏移字段的长度为 13 位。因此,IPv6 最多可以携带 2 14 个八位位组的数据作为一系列片段,或者大约 65,536 个八位位组的数据块加上一个 MTU 大小的片段。任何大于此大小的数据都需要通过更高层协议以某种方式分解,然后再传递到 IPv6 进行传输。
This size of the total data block IPv6 can carry through fragments is limited by the size of the offset field, which is 13 bits long. Hence, IPv6 can carry, at most, 214 octets of data as a series of fragments, or a data block of about 65,536 octets plus one MTU-sized fragment. Anything larger than this would need to be broken up, in some way, by a higher layer protocol before being passed to IPv6 for transport.
最后,IP 必须提供某种方式在使用多种物理层的网络上传送数据包。这是通过在可能互连多种媒体类型的网络中的每一跳重写较低层标头来解决的。以这种方式重写下层标头的设备最初称为网关,但现在通常称为路由器,因为它们根据 IP 标头中包含的信息路由流量。第 7 章“数据包交换”中更详细地讨论了数据包交换。
Finally, IP must provide for some way to carry packets across a network that uses more than one type of physical layer. This is solved by rewriting the lower layer headers at each hop in the network where multiple media types might be interconnected. Devices that rewrite the lower layer headers in this way were originally called gateways, but are generally called routers now, because they route traffic based on the information contained in the IP header. Packet switching is considered in more detail in Chapter 7, “Packet Switching.”
IPv6 传输数据的方式还有一些其他有趣的方面;图 5-4说明了要使用的 IPv6 标头。
There are some other interesting aspects of the way IPv6 carries data; Figure 5-4 illustrates an IPv6 header to work from.
在图 5-4中:
In Figure 5-4:
•对于IPv6,版本设置为6。
• The version is set to 6, for IPv6.
•流量类别分为两个字段,6 位用于承载服务类型(或服务类别),2 位用于承载拥塞通知。第 8 章“服务质量”更详细地讨论了服务质量(QoS) 。
• The traffic class is divided into two fields, 6 bits for carrying the type of ser-vice (or service class), 2 bits for carrying congestion notification. Quality of Service (QoS) is considered in more detail in Chapter 8, “Quality of Service.”
•流标签被设计为提示,告诉转发设备如何将数据包保留在等价多路径(ECMP) 路径集中同一路径上的单个流内。
• The flow label is designed as a hint to tell forwarding devices how to keep packets within a single flow on the same path in an equal cost multipath (ECMP) set of paths.
•有效负载长度表示数据包中携带的数据量(以八位字节为单位)。
• The payload length indicates the amount of data being carried in the packet in octets.
•下一个标头提供有关数据包中包含的任何其他标头的信息。IPv6 标头可以包含基本标头之外的信息;这些可选标头将在下一节中更详细地讨论。
• The next header provides information about any additional headers contained in the packet. The IPv6 header can contain information beyond what is contained in the basic header; these optional headers are discussed in more detail in a following section.
•跳数限制是网络设备在丢弃该数据包之前可以“处理”的次数。任何重写下层报头的路由器(或其他设备)都应该在转发过程中将该数字减一;当跳数限制达到0或1时,数据包应该被丢弃。
• The hop limit is the number of times this packet can be “handled” by a network device before being dropped. Any router (or other device) that rewrites the lower layer headers should decrement this number by one in the forwarding process; when the hop limit reaches 0 or 1, the packet should be discarded.
笔记
Note
跳数用于防止数据包在网络中永远循环。网络设备每转发一次数据包,跳数就会减一。如果跳数达到 0,则丢弃该数据包。如果数据包在网络内循环,跳数(也称为生存时间或 TTL)最终将减少到 0,并且数据包将被丢弃。
The hop count is used to prevent a packet from looping in a network forever. Each time the packet is forwarded by a network device, the hop count is decremented by one. If the hop count reaches 0, the packet is discarded. If a packet is looping within the network, the hop count (also called a Time to Live, or TTL) will eventually be reduced to 0, and the packet will be dropped.
IPv6 标头是可变(类型长度值[TLV])和固定长度信息的混合。基本标头由固定长度字段组成,但下一个标头字段留下了可选(或扩展)标头的可能性,其中一些标头被格式化为 TLV。这允许构建定制硬件(例如,专用集成电路 [ASIC])来基于固定长度字段快速交换数据包,同时保留携带只能在软件中处理的可变长度数据的可能性。
The IPv6 header is a mixture of variable (Type Length Value [TLV]) and fixed length information. The basic header is made up of fixed length fields, but the next header field leaves open the possibility of optional (or extension) headers, some of which are formatted as TLVs. This allows custom hardware (for instance, an Application-Specific Integrated Circuit [ASIC]) to be built to quickly switch packets based on the fixed length fields, while leaving open the possibility of carrying variable length data that might only be processed in software.
IPv6 以两种不同的方式实现多路复用:
IPv6 enables multiplexing in two different ways:
• 通过提供大的地址空间来用于识别主机和网络(或更广泛地,可到达的目的地)
• By providing a large address space to use in identifying hosts and networks (or, more largely, reachable destinations)
• 通过提供一个空间,上层协议可以在其中放置协议号,从而允许多个协议在 IPv6 之上运行
• By providing a space into which the upper layer protocol can place a protocol number, which allows multiple protocols to run on top of IPv6
IPv6 地址为 128 位,这意味着最多可以有 2128 个地址,地址数量巨大,可能足以数清地球上的每一粒尘埃。IPv6地址通常写为一系列十六进制数字,而不是一系列128个0和1,如图5-5所示。
The IPv6 address is 128 bits, which means there can be up to 2128 addresses—a vast number of addresses, enough to perhaps number every grain of dust on the Earth. The IPv6 address is normally written as a series of hexadecimal numbers, rather than as a series of 128 0s and 1s, as shown in Figure 5-5.
在 IPv6 地址格式中,零上有两点值得注意:
Two points on zeros are worth noting in the IPv6 address format:
• 省略每个部分中的前导零(用冒号分隔)。
• Leading zeros in each section (set off by colons) are omitted.
• 地址中的单个长字符串零可以用双冒号替换一次(不能两次)。
• A single long string of zeros can be replaced by a double colon once in the address (not twice).
当网络中的每个地址都以同一组数字开头时,有时也会仅包含更改的部分以缩短地址。例如,在具有 2001:db8:3e8:100::1 和 2001:db8:3e8:101::2 的网络中,这两个地址可以称为 100::1 和 101::2,而不是重复整个地址。您需要根据上下文填写地址的其余部分,例如网络图或之前提到的地址等。
When every address in the network begins with the same set of numbers, sometimes only the part that changes will be included to shorten the address, as well. For instance, in a network with 2001:db8:3e8:100::1 and 2001:db8:3e8:101::2, the two addresses may be referred to as 100::1 and 101::2, rather than repeating the entire address. You will need to fill in the remainder of the address from the context, such as a network diagram, or some earlier mention of the address, etc.
为什么有这么多地址?因为许多地址从未在任何寻址方案中使用过。
Why so many addresses? Because many addresses are never used in any addressing scheme.
首先,许多地址从未被使用,因为地址是聚合的。聚合是使用单个前缀(或网络、或可达目的地)来表示大量的可达目的地;如图 5-6所示。
First, many addresses are never used because addresses are aggregated. Aggregation is the use of a single prefix (or network, or reachable destination) to represent a larger number of reachable destinations; Figure 5-6 illustrates.
在图 5-6中:
In Figure 5-6:
• 主机A 和B 的IPv6 地址分别为101::1 和101::2。然而,这两台主机连接到单个广播网段(例如以太网),因此在 C 处共享相同的接口。尽管 C 在这个共享网络上有一个地址,但它实际上是在通告网络本身 - 一些工程师发现它将线路本身视为可到达的目的地是有帮助的:101::/64。
• Hosts A and B are given 101::1 and 101::2 as their IPv6 addresses. These two hosts are, however, connected to a single broadcast segment (such as Ethernet), and hence share the same interface at C. Even though C has an address on this shared network, it actually advertises the network itself—some engineers find it helpful to think of the wire itself—as a reachable destination: 101::/64.
• E 接收两个可达目的地,即来自 C 的 101::/64 和来自 D 的 102::/64。通过减少前缀长度,它可以通告包含这两个较长前缀可达目的地的单个可达目的地。E 发布广告 100::/60。
• E receives two reachable destinations, 101::/64 from C and 102::/64 from D. By decreasing the prefix length, it can advertise a single reachable destination that includes both of these two longer prefix reachable destinations. E advertises 100::/60.
• G 依次从 E 接收 100::/60,从 F 接收 110:/60。同样,可以使用单个可到达的目的地 100::/56 来描述相同的地址空间,因此这就是 G 所通告的内容。
• G, in turn, receives 100::/60 from E, and 110:/60 from F. Again, this same address space can be described using a single reachable destination, 100::/56, so this is what G advertises.
这种聚合在实际地址空间中如何工作?用图5-7进行说明。
How does this aggregation work in the actual address space? Figure 5-7 is used to explain.
前缀长度是可到达目的地中斜杠后面的数字,它告诉您在确定什么是前缀的一部分(以及什么不是前缀)时需要考虑的位数。前缀长度是从左到右计算的。前缀长度内数字中具有相同值的任何地址集都被视为同一可达目的地的一部分。
The prefix length, which is the number after the slash in a reachable destination, tells you the number of bits that count in determining what is part of the prefix (and hence also what is not). The prefix length is counted from the left to the right. Any set of addresses with the same values in the numbers within the prefix length are considered to be part of the same reachable destination.
• 完整的IPv6 地址空间有128 位,因此/128 代表单个主机。
• There are 128 bits in the full IPv6 address space, so a /128 represents a single host.
• 在具有64 位前缀长度(/64) 的地址中,只有IPv6 地址的左侧四部分是前缀或可到达目的地的一部分。其余部分(IPv6 地址的四个右侧部分)被假定为前缀“包含”的主机地址或子网地址。
• In an address with a 64-bit prefix length (/64), only the left four sections of the IPv6 address are part of the prefix, or the reachable destination; the remainder, the four right sections of the IPv6 address, are assumed to either be host or subnetwork addresses that are “contained” in the prefix.
• 在前缀长度为60 位(/60) 的地址中,IPv6 地址的左侧四部分减去一位十六进制数字被视为可到达目的地或前缀的一部分。
• In an address with a 60-bit prefix length (/60), the left four sections of the IPv6 address minus one hexadecimal digit are considered part of the reachable destination, or the prefix.
• 在前缀长度为56 位(/56) 的地址中,IPv6 地址的左侧四部分减去两个十六进制数字被视为可到达目的地或前缀的一部分。
• In an address with a 56-bit prefix length (/56), the left four sections of the IPv6 address minus two hexadecimal digits are considered part of the reachable destination, or the prefix.
笔记
Note
只要您始终以 4 为增量更改前缀长度(/4、/8、/12、/16 等),有效数字或属于前缀的数字将始终将 1 移至向右(当您增加前缀长度时)或向左(当您减少前缀长度时)。
So long as you always change the prefix length in increments of 4 (/4, /8, /12, /16, etc.), the significant digits, or the digits that are part of the prefix, will always move one to the right (as you increase the prefix length) or the left (as you decrease the prefix length).
聚合有时看起来很复杂,但它是IP的重要组成部分。
Aggregation sometimes seems complicated, but it is an essential part of IP.
自动配置会消耗一些地址空间。虽然这里没有详细介绍自动配置,但自动配置和 IPv6 地址分配之间的交互是需要考虑的重要因素。通常必须留出一定量的地址空间,以确保连接到网络的两个设备最终不会具有相同的标识符。对于 IPv6,在某些地址范围内,会留出一半的地址空间(大于 /64 的所有地址空间),以形成每个设备的唯一标识符。
Some of the address space is consumed in autoconfiguration. While autoconfigu-ration is not covered in detail here, the interaction between autoconfiguration and IPv6 address assignment is important to consider. Some amount of address space must generally be set aside to ensure no two devices connected to the network will end up with the same identifier. In the case of IPv6, half of the address spaces (everything greater than a /64), within certain ranges of addresses, are set aside in order to form unique per device identifiers.
第三,一些地址被留作特殊用途。例如,在 IPv6 中,以下地址空间被分配给某些特殊用途:
Third, some addresses are set aside for special use. For instance, in IPv6, the following address spaces are assigned to some special use:
• ::ffff/96 预留给“映射到”IPv6 地址空间的IPv4 地址。
• ::ffff/96 is set aside for IPv4 addresses that are “mapped into” the IPv6 address space.
• fc00::/7 预留给唯一本地地址(ULA);具有这些地址的数据包并不打算在全球互联网上路由,而是保存在单个组织的网络内。
• fc00::/7 is set aside for unique local addresses (ULAs); packets with these addresses are not intended to be routed on the global Internet, but rather kept within the network of a single organization.
• fe80::/10 预留给链路本地地址;这些地址是在每个接口上自动分配的,并且仅用于通过单个物理或虚拟链路进行通信。
• fe80::/10 is set aside for link local addresses; these addresses are automatically assigned on each interface, and are only used for communicating over a single physical or virtual link.
• ::/0 被预留为默认路由;如果网络设备不知道到达特定目的地的任何其他方式,它将向默认路由转发流量。
• ::/0 is set aside as a default route; if a network device does not know of any other way to reach a particular destination, it will forward traffic toward the default route.
第四,设备可以分配多个地址。许多工程师倾向于将地址视为描述单个主机或系统。实际上,一个地址可以用来描述很多东西,包括
Fourth, devices can be assigned multiple addresses. Many engineers tend to think of an address as if it describes a single host or system. In reality, a single address can be used to describe many things, including
• 单个主机或系统
• A single host or system
• 主机或系统上的单一接口;具有多个接口的主机将有多个地址
• A single interface on a host or system; a host with multiple interfaces would have multiple addresses
• 主机或系统上的一组可访问的服务;例如,在主机上运行的虚拟机或特定服务可能被分配的地址不同于分配给主机接口的任何地址
• A set of reachable services on a host or system; for example, a virtual machine or a particular service running on a host may be assigned an address that is different from any of the addresses assigned to the host’s interfaces
地址和物理设备或者地址和物理接口之间没有必然的直接关联。
There is no necessary direct correlation between an address and a physical device, or an address and a physical interface.
第二种多路复用机制允许多个协议在同一基础层上运行。这种形式的复用是通过协议号提供的;如图 5-8所示。
The second multiplexing mechanism is allowing multiple protocols to run over the same base layer. This form of multiplexing is provided through protocol numbers; Figure 5-8 illustrates.
下一个标头字段要么指向
The next header field either points to
• IPv6 数据包中的下一个标头(如果有下一个标头)
• The next header in the IPv6 packet, if there is a next header
• 协议号,如果下一个标头是传输协议(例如 TCP)
• A protocol number, if the next header is a transport protocol (such as TCP)
这些附加标头称为可选标头或扩展标头;其中一些是固定长度的,另一些是基于TLV的;例如:
These additional headers are called optional or extension headers; some of them are fixed length, and others are TLV based; for instance:
•逐跳选项:一组 TLV,描述每个转发设备应采取的操作
• Hop-by-hop options: A set of TLVs describing actions each forwarding device should take
•路由:一组固定长度的路由类型,用于指示数据包通过网络应采取的路径
• Routing: A set of fixed length route types used to indicate the path the packet should take through the network
•片段:提供数据包片段信息的固定长度字段集(如上所述)
• Fragment: A fixed length set of fields providing packet fragment information (as described above)
•身份验证标头:一组包含身份验证和/或加密信息的 TLV
• Authentication header: A set of TLVs containing authentication and/or encryption information
• Jumbogram:可选的数据长度字段,使 IPv6 数据包能够携带最多 1 个字节(小于 4GB 的数据)
• Jumbogram: An optional data length field enabling the IPv6 packet to carry up to one byte less than 4GB of data
下一个标头字段长 8 位,这意味着它可以携带 0 到 255 之间的数字。此范围内的每个数字都分配给特定类型的选项标头或特定的高层协议。例如:
The next header field is 8 bits long, which means it can carry a number between 0 and 255. Each number in this range is assigned either to a specific kind of option header or a specific higher layer protocol. For instance:
• 0:下一个标头是 IPv6 逐跳选项。
• 0: The next header is an IPv6 hop-by-hop option.
• 1:数据包有效负载是互联网控制消息协议(ICMP)。
• 1: The packet payload is the Internet Control Message Protocol (ICMP).
• 6:数据包有效负载是TCP。
• 6: The packet payload is TCP.
• 17:数据包有效负载是用户数据报协议(UDP)。
• 17: The packet payload is the User Datagram Protocol (UDP).
• 41:数据包有效负载是IPv6。
• 41: The packet payload is IPv6.
• 43:下一个标头是IPv6 路由标头。
• 43: The next header is an IPv6 routing header.
• 44:下一个标头是IPv6 分段标头。
• 44: The next header is an IPv6 fragment header.
• 50:下一个标头是封装安全标头(ESH)。
• 50: The next header is an Encapsulated Security Header (ESH).
接收主机使用协议号将数据包的内容分派到正确的本地进程进行处理;通常,这意味着从数据包中剥离较低(物理)层标头,将数据包放入正确进程(例如 TCP)的输入队列中,然后通知操作系统相关进程需要运行。
The protocol number is used by the receiving host to dispatch the contents of the packet to the correct local process for processing; normally, this means stripping the lower (physical) layer headers off the packet, placing the packet into the input queue for the correct process (such as TCP), and then notifying the operating system the relevant process needs to run.
TCP 的主要目标是在 IP 之上提供面向连接的传输。作为高层协议,它依赖于寻址和IPv6 的多路复用功能可将信息传送到正确的目标主机。因此,TCP 不需要地址方案。TCP 的重点是流量和错误控制,在下面的单独部分中进行考虑。关于 TCP 端口号的一小节完善了 TCP 的讨论。
The primary goal of TCP is to provide what appears to be a connection-oriented transport on top of IP. As a higher layer protocol, it relies on the addressing and multiplexing capabilities of IPv6 to carry information to the correct destination host. Because of this, TCP does not require an address scheme. The focus of TCP is on flow and error control, considered in separate sections below. A short section on TCP port numbers rounds out this discussion of TCP.
TCP使用滑动窗口方法来控制两个主机之间每个连接上的信息流;如图 5-9所示。
TCP uses a sliding window method to control the flow of information across each connection between two hosts; Figure 5-9 illustrates.
在图 5-9中,假设初始窗口大小设置为 20。事件顺序为
In Figure 5-9, assume the initial window size is set to 20. The sequence of events is then
• 在t1处,发送方传输10 个数据包或八位位组数据(在TCP 的情况下,为10 个八位位组数据)。
• At t1, the sender transmits 10 packets or octets of data (in the case of TCP, it is 10 octets of data).
• 在t2处,接收方确认这10 个八位位组,并且窗口设置为30。这意味着发送方现在可以在等待另一个确认之前再发送最多30 个八位位组的数据。换句话说,发送方最多可以发送 40 个八位字节,然后必须等待确认才能发送更多数据。
• At t2, the receiver acknowledges these 10 octets, and the window is set to 30. This means the sender is now allowed to send up to 30 more octets of data before waiting for another acknowledgment; in other words, the sender can send up to octet 40 before it must wait for an acknowledgment to send more data.
• 在t3处,发送方发送另外5 个八位位组的数据,编号为11-15。
• At t3, the sender sends another 5 octets of data, numbers 11–15.
·在t4处,接收器确认接收到15个八位位组,并且窗口被设置为40个八位位组。
• At t4, the receiver acknowledges the receipt of the octets through 15, and the window is set to 40 octets.
• 在t5处,发送方发送大约20 个八位位组的数据,编号为16-35。
• At t5, the sender sends about 20 octets of data, numbered 16–35.
• 在t6处,接收器确认 35 并且窗口设置为 50。
• At t6, the receiver acknowledges 35 and the window is set to 50.
关于该技术需要注意的几个要点如下:
Several important points to note about this technique are as follows:
• 当接收方确认接收到特定数据时,它也隐式地确认接收到该数据之前的所有内容。
• When the receiver acknowledges receiving a particular piece of data, it implicitly also acknowledges receiving everything before this piece of data.
• 如果接收方未发送确认(假设发送方在t5发送16–35,并且接收方未发送确认),则发送方将等待一段时间并假设数据从未到达,因此将重传数据。
• If the receiver does not send an acknowledgment—say the transmitter sends 16–35 at t5, and the receiver does not send an acknowledgment—the sender will wait some period of time and assume the data never arrived, so it will retransmit the data.
• 如果接收方确认了发送方已传输的部分数据,但不是全部,则发送方会假定某些数据丢失,并从接收方已确认的点重新传输。例如,如果发送方在t6发送了 16-35 ,并且接收方确认了 30,则发送方应重传 30 并转发。
• If the receiver acknowledges some of the data the sender has transmitted, but not all of it, the sender assumes some of the data is missing, and retransmits from the point the receiver has acknowledged. For instance, if the sender transmitted 16–35 at t6, and the receiver acknowledged 30, the sender should retransmit 30 and forward.
• 发送端和接收端均设置窗口;这将在下一节中进行更详细的解释。
• The window is set at both the sender and the receiver; this is explained in more detail in a following section.
TCP 不使用八位字节数,而是为每个传输分配一个序列号;当接收器确认特定序列号时,发送器假定接收器实际上已经接收到直到带有序列号的传输为止的所有八位位组信息。对于 TCP 来说,序列号充当一组八位位组的一种“简写”。图 5-10说明了这一点。
Instead of using octet numbers, TCP assigns each transmission a sequence number; when the receiver acknowledges a specific sequence number, the transmitter assumes the receiver has actually received all the octets of information up to the transmission with the sequence number. For TCP, then, the sequence number acts as a sort of “shorthand” for a set of octets. Figure 5-10 illustrates.
图 5-10中:
In Figure 5-10:
• 在t1处,发送方捆绑八位字节 1-10 并传输它们,将它们标记为序列号 1。
• At t1, the sender bundles octets 1–10 and transmits them, marking them as sequence number 1.
• 在t2处,接收器确认序列号1,隐式确认收到八位字节1-10。
• At t2, the receiver acknowledges sequence number 1, implicitly acknowledging the receipt of octets 1–10.
• 在t3处,发送方将八位字节 11-15 捆绑在一起并传输它们,将它们标记为序列号 2。
• At t3, the sender bundles octets 11–15 together and transmits them, marking them as sequence number 2.
• 在t4处,接收器确认序列号 2,隐式确认通过 15 发送的八位字节。
• At t4, the receiver acknowledges sequence number 2, implicitly acknowledging the octets sent through 15.
• 在t5时,假设单个数据包可容纳10 个八位位组;在这种情况下,发送方将发送两个数据包,一个包含 16-25,序列号为 3,另一个包含八位字节 26-35,序列号为 4。
• At t5, assume 10 octets will fit into a single packet; in this case, the sender would send two packets, one containing 16–25, with the sequence number 3, and one containing octets 26–35, with sequence number 4.
• 在t6处,接收器确认序列号4,隐含地确认所有先前传输的数据。
• At t6, the receiver acknowledges sequence number 4, implicitly acknowledging all the previously transmitted data.
以下各节考虑与 TCP 使用的窗口流量控制方案相关的各种问题。
The sections that follow consider various questions in relation to the windowed flow control scheme used by TCP.
如果未收到 100 个数据包流中的第一个数据包怎么办?使用图 5-10中描述的系统,接收方将不会确认第一个信息包,从而迫使发送方稍后重新传输数据。然而,这是低效的;每个丢弃的信息包都需要从该包转发处完全重新发送。TCP 实现使用两种不同的方式来允许接收者请求单个数据包。
What if the first packet out of a flow of 100 packets is not received? Using the system described in Figure 5-10, the receiver would simply not acknowledge this first packet of information, forcing the sender to retransmit the data sometime later. This is inefficient, however; each dropped packet of information requires a complete resend from that packet forward. TCP implementations use two different ways to allow a single packet to be requested by a receiver.
第一种方式是三重确认。如果接收方确认的数据包早于最近确认的序列号三倍,则发送方认为接收方要求重新传输该数据包。三重复确认用于防止无序数据包传送或丢弃的数据包导致错误的重传请求。
The first way is a triple acknowledgment. If a receiver acknowledges a packet that is earlier than the most recently acknowledged serial number three times, the sender assumes the receiver is asking for the packet to be retransmitted. Three repeated acknowledgments are used to prevent out-of-order packet delivery, or dropped packets, from causing a false retransmit request.
第二种方法是实施选择性确认(SACK)。15 SACK 在 TCP 确认中添加了一个新字段,允许接收方确认收到一组特定的序列号,而不是假设单个序列号的确认也确认每个较低的序列号。
The second way is to implement selective acknowledgments (SACK).15 SACK adds a new field to the TCP acknowledgment that allows a receiver to acknowledge the receipt of a specific set of serial numbers, rather than assuming the acknowledgment of a single serial number acknowledges every lower serial number as well.
发送方检测数据包丢失的第一种方法是通过重传超时 (RTO),它是根据往返时间(RTT 或 rtt)计算的。rtt 是发送方发送数据包与接收方收到确认之间的时间间隔。rtt 测量从发送器到接收器的网络延迟、接收器的处理时间以及从接收器到发送器的网络延迟。请注意,rtt 可能会有所不同,具体取决于每个数据包通过网络所采用的路径、数据包交换时的本地条件等。
The first way in which a sender can detect a packet has been lost is through the Retransmit Time Out (RTO), which is calculated as a function of the Round Trip Time (RTT or rtt). The rtt is the time interval between the transmission of a packet by a sender and the receipt of an acknowledgment from the receiver. The rtt measures the delay through the network from the transmitter to the receiver, the processing time at the receiver, and the delay through the network from the receiver to the transmitter. Note the rtt can vary depending on the path each packet takes through the network, local conditions at the time the packet is switched, etc.
RTO 通常计算为加权平均值,其中较旧的 rtt 的影响小于最近测量的 rtt 的影响。
The RTO is normally calculated as a weighted average in which older rtts have less impact than more recent measured rtts.
大多数 TCP 实现中使用的替代机制是快速重传。在快速重传中,接收器在任何确认中将预期序列号加一。例如,如果发送方发送序列10,则接收方确认序列11,即使它尚未接收到序列11。在这种情况下,确认中的序列号确认数据的接收,并指示发送方接下来要发送的序列号。
An alternative mechanism used in most TCP implementations is fast retransmit. In fast retransmit, the receiver adds one to the expected sequence number in any acknowledgment. For instance, if a sender transmits sequence 10, the receiver acknowledges sequence 11, even though it has not yet received sequence 11. In this case, the sequence number in the acknowledgment acknowledges the receipt of data and indicates what sequence number it is expecting the sender to transmit next.
如果发送器连续三次收到序列号比最后确认的序列号大1的确认,则它将假定后续数据包已被丢弃。
If the transmitter receives an acknowledgment with a sequence number that is one larger than the last acknowledged sequence number three times in a row, it will assume the packets following have been dropped.
因此,在实现快速启动时,TCP中的丢包有两种类型。第一种是标准超时,当发送方传输数据包并且在 RTO 到期之前未收到确认时会发生这种情况。这称为 RTO 故障。第二种称为快速重传失败。这两种情况的处理方式通常不同。
There are, therefore, two types of packet loss in TCP when fast start is implemented. The first is a standard timeout, which occurs when the sender transmits a packet, and does not receive an acknowledgment before the RTO expires. This is called an RTO failure. The second is called a fast retransmit failure. These two conditions are often handled differently.
选择窗口大小时需要考虑多种不同的因素,但主导因素通常是在避免链路拥塞的同时获得尽可能高的性能。事实上,TCP 拥塞控制可能是主要形式拥塞控制实际上部署在全球互联网中。要理解 TCP 拥塞控制,最好从一些定义开始:
There are a number of different considerations in choosing a window size, but the dominant factor is often gaining the highest possible performance while avoiding link congestion. In fact, TCP congestion control is probably the primary form of congestion control actually deployed in the global Internet. To understand TCP congestion control, it is best to begin with some definitions:
•接收窗口(RWND):接收器愿意接收的数据量;该窗口通常是根据接收器的缓冲区大小或接收器可用的其他一些资源来设置的。这是 TCP 标头中通告的窗口大小。
• Receive Window (RWND): The amount of data the receiver is willing to receive; this window is normally set based on the receiver’s buffer size, or some other resource available at the receiver. This is the window size advertised in the TCP header.
•拥塞窗口(CWND):发送器在收到确认之前愿意发送的数据量。该窗口不会在 TCP 标头中通告;接收方不知道 CWND 的大小。
• Congestion Window (CWND): The amount of data the transmitter is willing to send before receiving an acknowledgment. This window is not advertised in the TCP header; the receiver does not know the size of the CWND.
•慢启动阈值(SST):发送方认为连接处于最大数据包速率且网络上不会发生拥塞的CWND。SST 最初由实现设置,并在数据包丢失的情况下根据所使用的拥塞避免机制进行更改。
• Slow Start Threshold (SST): The CWND at which the sender considers the connection at its maximum packet rate without congestion occurring on the network. The SST is initially set by the implementation, and changed in the case of packet loss depending on the congestion avoidance mechanism being used.
大多数 TCP 实现都使用慢启动算法开始会话。16在此阶段,CWND 从 1、2 或 10 开始。对于收到确认的每个段,CWND 的大小增加 1。鉴于此类确认所花费的时间不会比单个 rtt 长很多,因此启动缓慢应该导致每个 rtt 窗口加倍。窗口将继续以此速率增加,直到数据包丢失(接收方无法确认数据包)、CWND 达到 RWND 或 CWND 达到 SST。一旦出现这三种情况中的任何一种,发送方就会转入拥塞避免模式。
Most TCP implementations begin sessions with a Slow Start algorithm.16 In this phase, the CWND starts at 1, 2, or 10. For each segment for which an acknowledgment is received, the size of CWND is increased by 1. Given such acknowledgments should take not much longer than a single rtt, slow start should cause the window to double each rtt. The window will continue increasing at this rate until either a packet is lost (the receiver fails to acknowledge a packet), CWND reaches RWND, or CWND reaches SST. Once any of these three conditions occur, the sender moves to congestion avoidance mode.
笔记
Note
每个 ACL 的 CWND 增加 1 如何使每个 rtt 的窗口加倍?想法是这样的:当窗口大小为 1 时,每个 rtt 应该接收一个段。当您将窗口大小增加到 2 时,您应该在每个 rtt 中收到 2 个段;到 4,您应该收到 4,等等。由于接收器单独确认每个段,并且每次确认一个段时将窗口增加 1,因此它应该在第一个 rtt 中确认 1 个段,并将窗口设置为 2;第二个rtt中的2个段,将窗口加2,将窗口设置为4;第三个rtt中的4段,向窗口添加4,将窗口大小设置为8等。
How does increasing CWND by 1 for each ACL received double the window each rtt? The thinking is this: When the window size is 1, you should receive one segment per rtt. When you increase the window size to 2, you should receive 2 segments in each rtt; to 4, you should receive 4, etc. As the receiver is acknowledging each segment separately, and increasing the window by 1 each time it acknowledges a segment, it should acknowledge 1 segment in the first rtt, and set the window to 2; 2 segments in the second rtt, adding 2 to the window, to set the window to 4; 4 segments in the third rtt, adding 4 to the window, to set the window size to 8, etc.
在拥塞避免模式下,CWND每个rtt增加一次,这意味着窗口的大小不再呈指数增长,而是线性增长。CWND 将继续增长,直到接收方无法确认数据包(TCP 假定这意味着数据包已丢失或丢弃),或者直到 CWND 达到 RWND。TCP 实现可以通过两种广泛部署的方式来响应数据包丢失,称为Tahoe和Reno。
In congestion avoidance mode, CWND is increased once each rtt, which means the size of the window stops growing exponentially and instead grows linearly. CWND will continue growing either until the receiver fails to acknowledge a packet (TCP assumes this means a packet has been lost or dropped), or until CWND reaches RWND. There are two broadly deployed ways in which a TCP implementation can respond to the loss of a packet, called Tahoe and Reno.
笔记
Note
Tahoe 和 Reno 实际上有许多不同的变体;这里只考虑非常基本的实现。当连接处于拥塞避免模式时,还有许多不同的方法可以对数据包丢失做出反应;“进一步阅读”部分包含有关在哪里查找其中一些其他方法的信息。
There are actually many different variations of Tahoe and Reno; only the very basic implementations are considered here. There are also many different methods for reacting to a packet loss while the connection is in congestion avoidance mode; the “Further Reading” section contains information on where to find out about some of these other methods.
如果使用 Tahoe 实现,并且通过快速重传发现数据包丢失,则会将 SST 设置为当前 CWND 的一半,将 CWND 设置为其原始值,然后再次开始慢启动。这意味着发送方将再次传输 1、2 或 10 个序列号,并为每个确认的序列号增加 CWND。与慢启动过程的开始一样,这具有使每个 rtt 的 CWND 加倍的效果。一旦 CWND 达到 SST,TCP 将转回拥塞避免模式。
If the implementation is using Tahoe, and the packet loss is discovered through a fast retransmit, it will set SST to half of the current CWND, set CWND to its original value, and begin slow start again. This means the sender will transmit 1, 2, or 10 sequence numbers again, increasing CWND for each sequence number acknowledged. As in the beginning of the slow start process, this has the effect of doubling CWND each rtt. Once CWND reaches SST, TCP will move back into congestion avoidance mode.
如果使用 Reno 实现,并且通过快速重传发现数据包丢失,它将把 SST和CWND 设置为当前 CWND 的一半,并继续在拥塞避免模式下运行。
If the implementation is using Reno, and the packet loss is discovered through a fast retransmit, it will set SST and CWND to half the current CWND, and continue operating in congestion avoidance mode.
在任一实现中,如果由于接收方未在 RTO 内发送确认而发现数据包丢失,则 CWND 将设置为 1,并使用慢启动来恢复连接速度。
In either implementation, if packet loss is discovered because the receiver does not send an acknowledgment within the RTO, the CWND is set to 1, and slow start is used to ramp the connection speed back up.
TCP 提供两种形式的错误检测和控制:
TCP provides two forms of error detection and control:
• 协议本身以及窗口机制可确保数据按顺序传送到应用程序,并且不会丢失任何信息。
• The protocol itself, along with the windowing mechanism, ensures data is delivered to the application in order and without any missing information.
• TCP 报头中包含的补码校验和被认为比循环冗余校验(CRC) 和许多其他形式的错误检测弱。此错误检查用于补充而不是取代堆栈中较低和较高协议提供的错误纠正。
• The one’s complement checksum included in the TCP header is considered weaker than a Cyclic Redundancy Check (CRC) and many other forms of error detection. This error check serves to complement, rather than replace, the error correction provided by protocols lower and higher in the stack.
如果接收方检测到校验和错误,它可以使用此处描述的任何机制来请求发送方重新传输数据 - 只是不确认接收通过SACK请求重传,通过快速重传主动不确认数据的接收,或者针对包含损坏数据的特定段发送三重确认。
If a receiver detects a checksum error, it can use any of the mechanisms described here to request the sender retransmit the data—simply not acknowledging the receipt of the data, requesting a retransmit through SACK, actively not acknowledging the receipt of the data through fast retransmit, or by sending a triple acknowledgment for the specific segment containing the corrupted data.
TCP 不直接管理任何类型的多路复用;但是,它确实提供了协议栈中 TCP 之上的应用程序和协议可以用来进行多路复用的端口号。虽然这些端口号由TCP携带,但它们通常对TCP 是不透明的;除了使用这些端口号将信息分派到接收主机上的正确应用程序之外,TCP 不会对这些端口号附加任何含义。
TCP does not directly manage any kind of multiplexing; however, it does provide port numbers that applications and protocols above TCP in the protocol stack can use to multiplex. While these port numbers are carried in TCP, they are generally opaque to TCP; TCP does not attach any meaning to these port numbers other than using them to dispatch information to the correct application on the receiving host.
TCP 端口号分为两大类:众所周知的端口号和临时端口号。众所周知的端口被定义为上层协议规范的一部分;这些端口是这些应用程序的“默认”端口。例如,支持简单邮件传输协议 (SMTP) 的服务通常可以通过在端口号 25 上使用 TCP 连接到主机来找到。支持超文本传输协议 (HTTP) 的服务通常可以通过使用以下命令连接到主机来找到: TCP 在端口 80 上。这些服务不一定需要使用这些端口号;大多数服务器可以配置为使用协议规范中指定的端口号以外的某个端口号。例如,不适合一般(或公共)使用的 Web 服务器可能会使用其他一些 TCP 端口,例如 8080。
TCP port numbers are divided into two broad classes: well known and ephemeral. Well-known ports are defined as a part of an upper layer protocol specification; these ports are the “default” ports for these applications. For instance, a service supporting the Simple Mail Transfer Protocol (SMTP) can generally be found by connecting to a host using TCP on port number 25. A service supporting the Hypertext Transport Protocol (HTTP) can normally be found by connecting to a host using TCP on port 80. These services do not necessarily need to use these port numbers; most servers can be configured to use some port number other than the one designated in the protocol specification. For instance, web servers not intended for general (or public) use may use some other TCP port, such as 8080.
临时端口仅对本地主机有意义,并且通常从本地主机上的可用端口号池中分配。临时端口最常用作 TCP 连接的源端口;例如,连接到服务器上端口 80 上的服务的主机将使用临时端口作为其源 TCP 端口。只要任何特定主机对任何 TCP 连接仅使用一次给定的临时端口号,则任何网络上的每个 TCP 会话都可以通过源地址、源端口、目标地址、目标端口和运行的协议号进行唯一标识在 TCP 之上。
Ephemeral ports are significant only to the local host and normally assigned from a pool of available port numbers on the local host. Ephemeral ports are most often used as source ports for TCP connections; for instance, a host connecting to a service at port 80 on a server will use an ephemeral port as its source TCP port. So long as any particular host uses a given ephemeral port number only once for any TCP connection, each TCP session on any network can be uniquely identified through the source address, source port, destination address, destination port, and the number of the protocol running on top of TCP.
TCP 使用三向握手来建立会话:
TCP uses a three-way handshake to set up a session:
1. 客户端向服务器发送同步(SYN)。该数据包是一个普通的 TCP 数据包,但在 TCP 标头中设置了 SYN 位,表示发送方正在请求与接收方建立会话。该数据包通常发送到众所周知的端口号,或者客户端知道服务器将在特定 IP 地址侦听的某个预先安排的端口号。该数据包包含客户端的初始序列号。
1. The client sends a synchronization (SYN) to the server. This packet is a normal TCP packet, but with a SYN bit set in the TCP header, and indicates the sender is requesting a session to be set up with the receiver. This packet is normally sent to a well-known port number, or some prearranged port number that the client knows a server will be listening on at a particular IP address. This packet includes the client’s initial sequence number.
2. 服务器发送 SYN 确认(SYN-ACK)。该数据包确认客户端提供的序列号加一,并包括服务器的初始序列号作为该数据包的序列号。
2. The server sends an acknowledgment for the SYN, a SYN-ACK. This packet acknowledges the sequence number provided by the client, plus one, and includes the server’s initial sequence number as the sequence number for this packet.
3. 客户端发送包含服务器初始序列号加一的确认 (ACK)。
3. The client sends an acknowledgment (ACK) including the server’s initial sequence number plus one.
此过程用于在开始传输数据之前确保客户端和服务器之间存在双向通信。在大多数实现中,发送方和接收方选择的初始序列号是随机的,以防止第三方攻击者猜测将使用什么序列号并在 TCP 会话形成的初始阶段接管 TCP 会话。17 号
This process is used to ensure two-way communication exists between the client and the server before beginning to transfer data. The initial sequence number chosen by the sender and receiver is randomized in most implementations to prevent a third-party attacker from guessing what sequence number will be used and taking over the TCP session in its initial stages of formation.17
2012 年,Jim Roskind 设计了一种新的传输协议,其主要目的是提高数据在相对稳定的高速网络上传输的速度。具体来说:
In 2012, Jim Roskind designed a new transport protocol with the primary intent of increasing the speed at which data can be transferred over relatively stable high-speed networks. Specifically:
• 将三向握手减少为单个数据包启动(零向握手)
• Reducing the three-way handshake to a single packet startup (a zero-way handshake)
• 减少传输数据所需的重传数据包数量
• Reducing the number of retransmitted packets required to transfer data
• 减少单个 TCP 流中多个数据流之间因数据包丢失而导致的队头阻塞
• Reducing head-of-line blocking across multiple data streams within a single TCP stream caused by packet loss
以下各节将考虑其中的每一个。
Each of these is considered in the sections that follow.
通常,RTT 无法更改,因为它通常受到发送方和接收方之间的物理距离和链路速度的限制。因此,减少总数据传输时间的最佳方法之一就是简单地减少发送方和接收方之间传输给定数据流或数据块所需的往返次数。QUIC 的启动旨在减少建立网络所需的往返次数新连接从TCP的三次握手到0往返时间的启动过程。
The rtt cannot, generally, be changed, because it is normally bounded by the physical distance and link speed between the sender and receiver. One of the best ways to reduce total data transfer time, then, is to simply reduce the number of round trips required between the sender and receiver to transfer a given stream or block of data. QUIC’s startup is designed to reduce the number of round trips required to set up a new connection from the three-way handshake of TCP to a 0 round trip time startup process.
为此,QUIC 使用一系列加密密钥和哈希(有关更多信息,请参阅第 10 章“传输安全”);过程是
To do this, QUIC uses a series of cryptographic keys and hashes (see Chapter 10, “Transport Security,” for more information); the process is
1. 客户端向服务器发送包含证明需求的 hello (CHLO),这是客户端将接受的用于验证服务器身份的证书类型列表;客户端有权访问的一组证书;以及客户端打算在此连接中使用的证书的哈希值。一个特定字段,源地址令牌(STK)将留空,因为之前没有与该服务器发生过通信。
1. The client sends the server a hello (CHLO) containing a proof demand, which is a list of certificate types the client will accept to verify the server’s identity; a set of certificates the client has access to; and a hash of the certificate the client intends to use in this connection. One specific field, the source address token (STK) will be left blank, because no communication has occurred with this server before.
2. 服务器将使用此信息根据客户端初始 hello 中提供的信息和客户端的源 IP 地址创建 STK。服务器发送拒绝(REJ),其中包含该STK。
2. The server will use this information to create an STK based on the information provided in the client’s initial hello and the client’s source IP address. The server sends a reject (REJ), which contains this STK.
一旦客户端拥有 STK,它就会将其包含在未来的 hello 数据包中。如果 STK 与该 IP 地址之前使用的 STK 匹配,服务器将接受 hello。
Once the client has the STK, it includes this in future hello packets. If the STK matches the previously used STK from this IP address, the server will accept the hello.
笔记
Note
该 IP 地址/STK 对可能会被窃取,因此源 IP 地址可能会被攻击者欺骗,从而能够访问包含该对的任何通信。这是 QUIC 中的一个已知问题,在本章末尾的“进一步阅读”部分指出的 QUIC 文档中已解决。
This IP address/STK pair can be stolen, and hence the source IP address can be spoofed by an attacker with access to any communication with this pair included. This is a known problem in QUIC, addressed in the QUIC documentation pointed to in the “Further Reading” section at the end of the chapter.
相比之下,TCP 至少需要一个半 rtts 来建立一个新会话:SYN、SYN-ACK,然后是后续的 ACK。迁移到单个 RTT 连接时间可以节省多少时间?当然,这取决于客户端和服务器应用程序的实现。然而,许多网页和移动设备应用程序必须连接到许多不同的服务器(可能数百个)才能构建单个网页或应用程序屏幕。如果每个连接从一个半 rtt 减少到一个 rtt,可能会对性能产生重大影响。
In comparison, TCP requires at least one-and-a-half rtts to set up a new session: the SYN, the SYN-ACK, and then the following ACK. How much time does moving to a single rtt connection time save? It depends on the implementation of the client and server applications, of course. However, many web pages and mobile device apps must connect to many different servers (perhaps hundreds) to build a single web page or application screen. If each of these connections is reduced from one-and-a-half rtts to a single rtt, there could be a significant performance impact.
QUIC 使用多种不同的机制来减少重传数据包的数量:
QUIC uses a number of different mechanisms to reduce the number of retransmitted packets:
• 在所有数据包中包含前向纠错(FEC);这允许接收者(通常)重建损坏的信息,而不是请求重新发送信息。
• Including Forward Error Correction (FEC) in all packets; this allows the receiver to (often) rebuild corrupted information rather than request the information to be resent.
• 使用否定确认(NACK)而不是SACK或三重ACK机制来请求重传特定序列号;这可以防止重传请求与导致发送多个确认的网络条件之间出现歧义。
• Using negative acknowledgments (NACKs) rather than SACK or the triple ACK mechanism to request retransmission of specific sequence numbers; this prevents ambiguity between a request for a retransmission and network conditions that cause multiple acknowledgments to be sent.
• 使用快速确认,如前面针对 TCP 所描述的。
• Using fast acknowledgments, as described previously for TCP.
• 使用CUBIC 拥塞避免窗口控制。
• Using the CUBIC congestion avoidance window control.
CUBIC 拥塞避免机制是其中最有趣的。CUBIC 尝试在数据包丢失之前的最后一个窗口大小和使用乘法因子计算出的某个较小窗口大小之间执行二分搜索。当检测到数据包丢失时(通过 RTO 超时或通过 NACK),最大窗口大小 (WMAX) 将设置为当前窗口大小,并计算新的最小窗口大小 (WMIN)。
The CUBIC congestion avoidance mechanism is the most interesting of these. CUBIC attempts to perform a binary search between the last window size before a packet drop and some lower window size calculated using a multiplicative factor. When a packet loss is detected (either through an RTO timeout or through a NACK), the maximum window size (WMAX) is set to the current window size, and a new minimum window size (WMIN) is calculated.
发送方的窗口设置为 WMIN,然后快速增加到 WMIN 和 WMAX 之间的窗口大小。一旦窗口到达这个中间点,窗口大小就会在所谓的探测中非常缓慢地增加,直到遇到下一个数据包丢失。此过程允许 CUBIC 找到刚好低于网络开始相当快地丢弃数据包的点的最大传输速率。
The sender’s window is set to WMIN and then quickly increased to a window size halfway between WMIN and WMAX. Once the window reaches this halfway point, the window size is increased very slowly in what is called probing, until the next packet drop is encountered. This process allows CUBIC to find the maximum transmission rate just below the point where the network begins dropping packets fairly quickly.
互联网上的“单笔交易”通常不是“单笔交易”,而是跨多个不同服务器的大量交易集合。例如,要构建单个网页,需要将数百个元素(例如图像、脚本、层叠样式表(CSS)元素和超文本标记语言(HTML)文件)从服务器传输到客户端。这些文件有两种传输方式:串行或并行。如图 5-11所示。
A “single transaction” across the Internet is often not a “single transaction,” but rather a large collection of transactions across a number of different servers. To build a single web page, for instance, hundreds of elements, such as images, scripts, Cascading Style Sheet (CSS) elements, and Hypertext Markup Language (HTML) files need to be transferred from the server to the client. There are two ways these files can be transferred: in serial or in parallel. Figure 5-11 illustrates.
在图 5-11中,说明了将多个元素从服务器传输到客户端的三个选项:
In Figure 5-11, three options are illustrated to transfer multiple elements from a server to a client:
• 在序列化选项中,元素在单个会话中一次传输一个。这是三个可能选项中最慢的一个,因为整个页面必须逐个元素构建,较小的元素等待较大的元素传输才能显示。
• In the serialized option, the elements are transferred one at a time across a single session. This is the slowest of the three possible options, as the entire page must be built element by element, with smaller elements waiting on larger ones to transfer before they can be displayed.
• 在多流选项中,每个元素都通过单独的连接(例如TCP 会话)进行传输。这要快得多,但需要建立多个连接,这可能会对客户端和服务器资源产生负面影响。
• In the multiple streams option, each element is transferred over a separate connection (such as a TCP session). This is much faster, but it requires multiple connections to be built, which can negatively impact the client and server resources.
• 在多路复用选项中,每个元素通过单个连接单独传输。这允许每个元素以其自己的速率传输,但具有多流选项的资源开销。
• In the multiplexed option, each element is transferred separately across a single connection. This allows each element to be transferred at its own rate, but with the resource overhead of the multiple streams option.
某种形式的多路复用传输机制往往会提供最快的传输速率和最有效的资源利用,但这种多路复用应该如何实现呢?超文本传输协议版本 2 (HTTPv2) 允许 Web 服务器在单个 HTTP 会话中复用多个元素;由于 HTTP 运行在 TCP 之上,这意味着单个 TCP 流可用于并行传输多个网页元素。然而,TCP 级别的单个丢弃数据包意味着 HTTP 流中的每个并行传输都必须暂停,同时 TCP 恢复(这是命运共享的一种形式)。
Some form of multiplexed transfer mechanism tends to provide the fastest transfer rate with the most efficient use of resources, but how should this multiplexing be implemented? The Hypertext Transfer Protocol version 2 (HTTPv2) allows a web server to multiplex multiple elements across a single HTTP session; since HTTP runs on top of TCP, this means a single TCP stream can be used to transfer multiple web page elements in parallel. However, a single dropped packet at the TCP level means every parallel transfer within the HTTP stream must be paused while TCP recovers (this is a form of fate sharing).
QUIC 通过允许多个 HTTPv2 流驻留在单个 QUIC 连接中来解决此问题。这减少了客户端和服务器的传输开销,同时提供网页元素的最佳交付。
QUIC solves this problem by allowing multiple HTTPv2 streams to reside within a single QUIC connection. This reduces the transport overhead at the client and server, while providing optimal delivery of the web page elements.
虽然 TCP 和 QUIC 等传输协议在中间层协议中最受关注,但还有许多其他协议对于基于 IP 的网络的运行也同样重要。其中就有ICMP,可以说它提供了有关网络本身的元数据。ICMP是一种简单的协议,用于请求特定的状态信息,或者用于网络设备发送有关为什么特定数据包在网络中某个点被丢弃的信息。具体来说:
While the transport protocols, such as TCP and QUIC, tend to receive the most attention among the middle tier of protocols, there are a number of other protocols that are just as important for the operation of an IP-based network. Among these is ICMP, which can be said to provide metadata about the network itself. ICMP is a simple protocol that is used to request specific state information, or for network devices to send information about why a particular packet is being dropped at some point in the network. Specifically:
• ICMP 可用于发送回显请求或回显应答。此功能用于 ping 特定的目标地址,可用于确定该地址是否可达,而不会在接收方消耗太多资源。
• ICMP can be used to send an echo request or echo reply. This functionality is used to ping a particular destination address, which can be used to determine if the address is reachable without consuming too many resources at the receiver.
• ICMP 可用于发送有关因数据包太大而无法通过链路传输(数据包太大)而被丢弃的通知。
• ICMP can be used to send a notification about a packet being dropped because it is too large to be transmitted across a link (the packet is too big).
• ICMP 可用于发送数据包已被丢弃的通知,因为其生存时间 (TTL) 已达到 0(数据包在传输过程中已过期)。
• ICMP can be used to send a notification that a packet has been dropped because its Time to Live (TTL) has reached 0 (the packet has expired in transit).
数据包太大响应可用于查找网络上的最大传输单元 (MTU);发送方可以传输一个大数据包,并等待网络中的某些设备是否通过 ICMP 发送数据包太大通知。如果这样的通知到达,发送者可以尝试逐渐减小的数据包,以确定可以通过网络端到端传输的最大数据包。
The packet too big response can be used to find the Maximum Transmission Unit (MTU) across a network; the sender can transmit a large packet and wait to see if some device in the network sends a packet too big notification through ICMP. If such a notification arrives, the sender can try progressively smaller packets to determine the largest packet that can be transmitted end-to-end across the network.
过期的传输响应可用于跟踪网络中从源到目的地的路由(这称为跟踪路由)。发送方可以使用任何传输层协议(包括 TCP、UDP 或 QUIC)将数据包传输到特定目的地,但 TTL 为 1。第一跳网络设备应递减 TTL 并发回 ICMP 在传输中过期通知给发件人。发送一系列数据包,每个数据包的 TTL 都比前一个大,路径上的每个设备都会被诱导向发送方发送 ICMP 传输过期通知,从而揭示数据包的整个路径。
The expired in transit response can be used to trace the route from a source to a destination in a network (this is called trace route). A sender can transmit a packet to a particular destination using any transport layer protocol (including TCP, UDP, or QUIC), but with a TTL of 1. The first hop network device should decrement the TTL and send an ICMP expired in transit notification back to the sender. Sending a series of packets, each with a TTL one larger than the previous one, each device along the path can be induced to transmit an ICMP expired in transit notification to the sender, revealing the entire path of the packet.
上层传输协议管理与下层传输协议相同的问题——错误控制、流量控制、传输和编组——仅是端到端而不是设备到设备。即便如此,虽然许多解决方案相似或相同,但许多其他解决方案却截然不同。本章考虑了四种不同的上层传输协议,其中两种在协议栈中占据相同的“空间”——TCP 和 QUIC——而另两种在协议栈中占据完全不同的空间——IP 和 ICMP。虽然对于所提出的问题还有其他解决方案,但这四个协议提出的解决方案涵盖了这些问题的大多数广泛部署的解决方案。
Upper layer transport protocols manage the same problems as lower layer transport protocols—error control, flow control, transport, and marshaling—only end to end rather than device to device. Even so, while many of the solutions are similar or the same, many other solutions are radically different. This chapter has considered four different upper layer transport protocols, two of which occupy the same “space” in a protocol stack—TCP and QUIC—and two of which occupy completely different spaces in a protocol stack—IP and ICMP. While there are other solutions to the problems presented, the solutions presented by these four protocols cover most of the widely deployed solutions to these problems.
下一章将让您从理解信息如何通过网络传输进入本章和上一章中考虑的各层如何交互的领域。这些层间问题与网络复杂性的状态/优化/表面模型中考虑的交互表面相关,您会发现在分析大量网络问题时很有用。
The next chapter moves you from understanding how information is transported across a network into the realm of how the layers considered in this and the previous chapter interact. These interlayer problems relate to the interaction surfaces considered in the State/Optimization/Surface model of network complexity you will find useful in analyzing a large number of network problems.
阿米蒂奇,格伦维尔。“五种新的 TCP 拥塞控制算法项目摘要。” FreeBSD 论坛。访问日期:2017 年 7 月 5 日。https: //forums.FreeBSD.org/threads/22396/。
Armitage, Grenville. “Summary of Five New TCP Congestion Control Algorithms Project.” The FreeBSD Forums. Accessed July 5, 2017. https://forums.FreeBSD.org/threads/22396/.
布兰顿、伊森、弗恩·帕克森博士和马克·奥尔曼。TCP 拥塞控制。征求意见 5681。RFC 编辑,2009。doi:10.17487/RFC5681。
Blanton, Ethan, Dr. Vern Paxson, and Mark Allman. TCP Congestion Control. Request for Comments 5681. RFC Editor, 2009. doi:10.17487/RFC5681.
Chu、HK Jerry 和 Vivek Kashyap。IP over InfiniBand (IPoIB) 传输。征求意见 4391。RFC 编辑,2006。doi:10.17487/RFC4391。
Chu, H. K. Jerry, and Vivek Kashyap. Transmission of IP over InfiniBand (IPoIB). Request for Comments 4391. RFC Editor, 2006. doi:10.17487/RFC4391.
罗伯特·G·科尔、大卫·H·舒尔博士和柯蒂斯·维拉米扎。IP over ATM:框架文档。征求意见 1932。RFC 编辑,1996。doi:10.17487/RFC1932。
Cole, Robert G., Dr. David H. Shur, and Curtis Villamizar. IP over ATM: A Framework Document. Request for Comments 1932. RFC Editor, 1996. doi:10.17487/RFC1932.
Deering、Steve E. 博士和 Robert M. Hinden。“互联网协议,版本 6 (IPv6) 规范。” 互联网草案。互联网工程任务组,2017 年 5 月。https: //datatracker.ietf.org/doc/html/draft-ietf-6man-rfc2460bis-13。
Deering, Dr. Steve E., and Robert M. Hinden. “Internet Protocol, Version 6 (IPv6) Specification.” Internet-Draft. Internet Engineering Task Force, May 2017. https://datatracker.ietf.org/doc/html/draft-ietf-6man-rfc2460bis-13.
德桑蒂、克劳迪奥、罗伯特·尼克松和克雷格·卡尔森。通过光纤通道传输 IPv6、IPv4 和地址解析协议 (ARP) 数据包。征求意见 4338。RFC 编辑,2006。doi:10.17487/RFC4338。
Desanti, Claudio, Robert Nixon, and Craig Carlson. Transmission of IPv6, IPv4, and Address Resolution Protocol (ARP) Packets over Fibre Channel. Request for Comments 4338. RFC Editor, 2006. doi:10.17487/RFC4338.
“国防部标准互联网协议。” IETF,1980 年 1 月。https ://tools.ietf.org/html/rfc760。
“DoD Standard Internet Protocol.” IETF, January 1980. https://tools.ietf.org/html/rfc760.
费尔赫斯特、戈里、玛丽-何塞·蒙佩蒂特、伯恩哈德·科利尼-诺克、希尔玛·林德和霍斯特·D·克劳森。通过 MPEG-2 网络传输 IP 数据报的框架。征求意见 4259。RFC 编辑,2005。doi:10.17487/RFC4259。
Fairhurst, Gorry, Marie-Jose Montpetit, Bernhard Collini-Nocker, Hilmar Linder, and Horst D. Clausen. A Framework for Transmission of IP Datagrams over MPEG-2 Networks. Request for Comments 4259. RFC Editor, 2005. doi:10.17487/RFC4259.
弗洛伊德、莎莉、贾姆希德·马哈达维、马特·马西斯和阿林·罗曼诺博士。TCP 选择性确认选项。征求意见 2018。RFC 编辑,1996。doi:10.17487/RFC2018。
Floyd, Sally, Jamshid Mahdavi, Matt Mathis, and Dr. Allyn Romanow. TCP Selective Acknowledgment Options. Request for Comments 2018. RFC Editor, 1996. doi:10.17487/RFC2018.
贡特、费尔南多和史蒂文·贝洛文。防御序列号攻击。征求意见 6528。RFC 编辑,2012。doi:10.17487/RFC6528。
Gont, Fernando, and Steven Bellovin. Defending against Sequence Number Attacks. Request for Comments 6528. RFC Editor, 2012. doi:10.17487/RFC6528.
古普塔、穆克什和亚历克斯·康塔。适用于 Internet 协议版本 6 (IPv6) 规范的 Internet 控制消息协议 (ICMPv6)。征求意见 4443。RFC 编辑,2006。doi:10.17487/RFC4443。
Gupta, Mukesh, and Alex Conta. Internet Control Message Protocol (ICMPv6) for the Internet Protocol Version 6 (IPv6) Specification. Request for Comments 4443. RFC Editor, 2006. doi:10.17487/RFC4443.
Ha、Sangtae、Injong Rhee 和 Lisong Xu。“CUBIC:一种新的 TCP 友好的高速 TCP 变体。” ACM SIGOPS 操作系统评论42,编号。5(2008 年 7 月):64-74。
Ha, Sangtae, Injong Rhee, and Lisong Xu. “CUBIC: A New TCP-Friendly High-Speed TCP Variant.” ACM SIGOPS Operating System Review 42, no. 5 (July 2008): 64–74.
休斯顿、杰夫. “处理 DNS 中的 IPv6 碎片。” APNIC 博客,2017 年 8 月 22 日。https ://blog.apnic.net/2017/08/22/dealing-ipv6-fragmentation-dns/。
Huston, Geoff. “Dealing with IPv6 Fragmentation in the DNS.” APNIC Blog, August 22, 2017. https://blog.apnic.net/2017/08/22/dealing-ipv6-fragmentation-dns/.
互联网控制消息协议。征求意见 792。RFC 编辑,1981。doi:10.17487/RFC0792。
Internet Control Message Protocol. Request for Comments 792. RFC Editor, 1981. doi:10.17487/RFC0792.
“互联网协议。” IETF,1981 年 9 月。https ://tools.ietf.org/html/rfc791。
“Internet Protocol.” IETF, September 1981. https://tools.ietf.org/html/rfc791.
IPv4 耗尽,2010 年 6 月 9 日。https ://commons.wikimedia.org/wiki/File: Ipv4-exhaust.svg 。
IPv4 Exhaustion, June 9, 2010. https://commons.wikimedia.org/wiki/File:Ipv4-exhaust.svg.
Jacobson, V.“拥塞避免和控制”。摘自《通信架构和协议研讨会论文集》,314-29。SIGCOMM '88。美国纽约州纽约:ACM,1988。doi:10.1145/52324.52356。
Jacobson, V. “Congestion Avoidance and Control.” In Symposium Proceedings on Communications Architectures and Protocols, 314–29. SIGCOMM ’88. New York, NY, USA: ACM, 1988. doi:10.1145/52324.52356.
贾迈勒、哈比布拉和基兰苏丹。“TCP 拥塞控制算法的性能分析。” 国际计算机与通信杂志2,no。1(2008):30-38。
Jamal, Habibullah, and Kiran Sultan. “Performance Analysis of TCP Congestion Control Algorithms.” International Journal of Computers and Communications 2, no. 1 (2008): 30–38.
Johansson,Peter G。基于 IEEE 1394 的 IPv4。征求意见 2734。RFC 编辑,1999。doi:10.17487/RFC2734。
Johansson, Peter G. IPv4 over IEEE 1394. Request for Comments 2734. RFC Editor, 1999. doi:10.17487/RFC2734.
卡茨、戴夫. 通过 FDDI 网络传输 IP 和 ARP。征求意见 1390。RFC 编辑,1993。doi:10.17487/RFC1390。
Katz, Dave. Transmission of IP and ARP over FDDI Networks. Request for Comments 1390. RFC Editor, 1993. doi:10.17487/RFC1390.
劳伦斯 (Joe L.) 和大卫 M.皮西泰洛 (David M. Piscitello)。通过 SMDS 服务传输 IP 数据报。征求意见 1209。RFC 编辑,1991。doi:10.17487/RFC1209。
Lawrence, Joe L., and David M. Piscitello. The Transmission of IP Datagrams over the SMDS Service. Request for Comments 1209. RFC Editor, 1991. doi:10.17487/RFC1209.
Luciani、James V. 博士、Bala Rajagopalan 博士和 Daniel O. Awduche。光网络上的IP:一个框架。征求意见 3717。RFC 编辑,2004 年。doi:10.17487/RFC3717。
Luciani, Dr. James V., Dr. Bala Rajagopalan, and Daniel O. Awduche. IP over Optical Networks: A Framework. Request for Comments 3717. RFC Editor, 2004. doi:10.17487/RFC3717.
马蒂斯、马特、南迪塔·杜基帕蒂和郑玉中。TCP 按比例降低费率。征求意见 6937。RFC 编辑,2013。doi:10.17487/RFC6937。
Mathis, Matt, Nandita Dukkipati, and Yuchung Cheng. Proportional Rate Reduction for TCP. Request for Comments 6937. RFC Editor, 2013. doi:10.17487/RFC6937.
帕纳贝克、拉斯顿、西蒙·韦格里夫和丹·齐格蒙德。在电视信号的垂直消隐间隔内进行 IP 传输。征求意见 2728。RFC 编辑,1999。doi:10.17487/RFC2728。
Panabaker, Ruston, Simon Wegerif, and Dan Zigmond. The Transmission of IP Over the Vertical Blanking Interval of a Television Signal. Request for Comments 2728. RFC Editor, 1999. doi:10.17487/RFC2728.
帕特里奇、克雷格博士、马克·奥尔曼和莎莉·弗洛伊德。增加 TCP 的初始窗口。征求意见 3390。RFC 编辑,2002。doi:10.17487/RFC3390。
Partridge, Dr. Craig, Mark Allman, and Sally Floyd. Increasing TCP’s Initial Window. Request for Comments 3390. RFC Editor, 2002. doi:10.17487/RFC3390.
Postel, J.“对互联网协议和 TCP 的评论”,1977 年 8 月 15 日。https: //www.rfc-editor.org/ien/ien2.txt。
Postel, J. “Comments on Internet Protocol and TCP,” August 15, 1977. https://www.rfc-editor.org/ien/ien2.txt.
———。“互联网协议规范草案,第 2 版”,1978 年 2 月。https ://www.rfc-editor.org/ien/ien28.pdf。
———. “Draft Internetwork Protocol Specification, Version 2,” February 1978. https://www.rfc-editor.org/ien/ien28.pdf.
———。“互联网协议规范,第 4 版”,1978 年 6 月。https ://www.rfc-editor.org/ien/ien41.pdf。
———. “Internetwork Protocol Specification, Version 4,” June 1978. https://www.rfc-editor.org/ien/ien41.pdf.
———。“互联网协议规范,第 4 版”,1978 年 9 月。https: //www.rfc-editor.org/ien/ien41.pdf。
———. “Internetwork Protocol Specification, Version 4,” September 1978. https://www.rfc-editor.org/ien/ien41.pdf.
“QUIC,基于 UDP 的多路复用流传输 — Chromium 项目。” 访问日期:2017 年 7 月 5 日。https ://www.chromium.org/quic。
“QUIC, a Multiplexed Stream Transport over UDP—The Chromium Projects.” Accessed July 5, 2017. https://www.chromium.org/quic.
里格尔、麦克斯、郑尚金和全洪锡。通过 IEEE 802.16 网络进行以太网 IP 传输。征求意见 5692。RFC 编辑,2009。doi:10.17487/RFC5692。
Riegel, Max, Sangjin Jeong, and HongSeok Jeon. Transmission of IP over Ethernet over IEEE 802.16 Networks. Request for Comments 5692. RFC Editor, 2009. doi:10.17487/RFC5692.
史蒂文斯,W.理查德。TCP 慢启动、拥塞避免、快速重传和快速恢复算法。征求意见 2001 年。RFC 编辑,1997 年。doi:10.17487/RFC2001。
Stevens, W. Richard. TCP Slow Start, Congestion Avoidance, Fast Retransmit, and Fast Recovery Algorithms. Request for Comments 2001. RFC Editor, 1997. doi:10.17487/RFC2001.
Varada,Srihari V。基于 PPP 的 IP 版本 6。征求意见 5072。RFC 编辑,2007。doi:10.17487/RFC5072。
Varada, Srihari V. IP Version 6 over PPP. Request for Comments 5072. RFC Editor, 2007. doi:10.17487/RFC5072.
维茨曼、大卫. 鸟类载体上的 IP 数据报传输标准。征求意见 1149。RFC 编辑,1990。doi:10.17487/RFC1149。
Waitzman, David. Standard for the Transmission of IP Datagrams on Avian Carriers. Request for Comments 1149. RFC Editor, 1990. doi:10.17487/RFC1149.
1. 选择使用 /64 作为主机地址空间通常被认为是有争议的。您认为这个具体选择有哪些积极和消极的方面?
1. The choice of using a /64 for the host address space is often considered controversial. What do you think are the positive and negative aspects of this specific choice?
2. 互联网多年来一直“耗尽 IPv4 地址空间”。对地址空间缺乏的反应之一是网络地址转换器(NAT)的广泛部署。以下问题与 NAT 相关。
2. The Internet has been “running out of IPv4 address space” for many years. One of the reactions to this lack of address space has been the widespread deployment of Network Address Translators (NATs). The following questions relate to NATs.
A。NAT 和端口地址转换器 (PAT) 之间有什么区别?
a. What is the difference between a NAT and a Port Address Translator (PAT)?
b. 为什么 PAT 会给 FTP 带来问题(例如)?这种情况一般怎么解决呢?
b. Why do PATs create a problem for FTP (for instance)? How is this normally solved?
C。在 IPv6 的整个发展过程中,存在着反对部署 NAT 和 PAT 的普遍运动。您能想到或找到工程师反对在全球互联网上使用 NAT 和 PAT 的原因吗?
c. Throughout the development of IPv6, there was a general movement against the deployment of NATs and PATs; can you think of or find the reasons engineers objected to the use of NAT and PAT on the global Internet?
3. 路由器和其他网络设备对数据包的分段已从 IPv6 规范中删除,尽管 IPv4 规范中允许这样做。删除此功能有哪些权衡?碎片会给网络设备增加什么复杂性,以及从网络设备中删除碎片会给最终主机增加什么复杂性?
3. The fragmentation of packets by routers and other network devices was removed from the IPv6 specification, although it was allowed in the IPv4 specifications. What are the tradeoffs in removing this capability? What complexity does fragmentation add to network devices, and what complexity does removing it from the network devices add to end hosts?
4. TCP 的实现如何区分众所周知的端口和临时端口?有必要吗?
4. How can an implementation of TCP differentiate between well-known and ephemeral ports? Does it need to?
5. 从安全角度来看,允许网络设备和主机响应 ICMP 与不允许它们响应相比,可能有哪些优点和缺点?
5. From a security perspective, what might be the advantages and disadvantages of allowing network devices and hosts to respond to ICMP, versus not allowing them to?
6. 用半字节(4 位)而不是像本章中那样以位来解释 IPv6 地址的聚合。
6. Explain the aggregation of IPv6 addresses in terms of nibbles (4 bits) rather than in bits, as is done in the chapter.
7、前缀长度为什么叫前缀长度?这个术语背后的历史是什么?
7. Why is the prefix length called the prefix length? What is the history behind this term?
8. 将前缀长度与旧方法进行比较,以确定网络地址终止点和主机位(或“网络设备不关心的位”),即子网掩码。您认为哪个更容易使用?
8. Compare the prefix length to the older method for determining the point where the network address stops and the host bits, or the “bits the network device does not care about,” begin, the subnet mask. Which do you think is easier to use?
1 . Postel,“对互联网协议和 TCP 的评论”,1。
1. Postel, “Comments on Internet Protocol and TCP,” 1.
2 . Postel,“互联网协议规范草案,版本 2”。
2. Postel, “Draft Internetwork Protocol Specification, Version 2.”
3 . Postel,“互联网协议规范,第 4 版”,1978 年 6 月。
3. Postel, “Internetwork Protocol Specification, Version 4,” June 1978.
4 . Postel,“互联网协议规范,第 4 版”,1978 年 9 月。
4. Postel, “Internetwork Protocol Specification, Version 4,” September 1978.
5 . “国防部标准互联网协议。”
5. “DoD Standard Internet Protocol.”
6 . “互联网协议。”
6. “Internet Protocol.”
7 . 先生,IPv4 耗尽。
7. Mro, IPv4 Exhaustion.
8 . Fairhurst 等人,通过 MPEG-2 网络传输 IP 数据报的框架。
8. Fairhurst et al., A Framework for Transmission of IP Datagrams over MPEG-2 Networks.
9 . Cole、Shur 和 Villamizar,ATM 上的 IP:框架文档。
9. Cole, Shur, and Villamizar, IP over ATM: A Framework Document.
10 . Luciani、Rajagopalan 和 Awduche,光网络上的 IP:框架。
10. Luciani, Rajagopalan, and Awduche, IP over Optical Networks: A Framework.
11 . Varada,基于 PPP 的 IP 版本 6。
11. Varada, IP Version 6 over PPP.
12 . Panabaker、Wegerif 和 Zigmond,《电视信号垂直消隐间隔上的 IP 传输》。
12. Panabaker, Wegerif, and Zigmond, The Transmission of IP Over the Vertical Blanking Interval of a Television Signal.
13 . Katz,通过 FDDI 网络传输 IP 和 ARP。
13. Katz, Transmission of IP and ARP over FDDI Networks.
14 . Waitzman,鸟类载体上 IP 数据报传输标准。
14. Waitzman, Standard for the Transmission of IP Datagrams on Avian Carriers.
15 . Floyd 等人,TCP 选择性确认选项。
15. Floyd et al., TCP Selective Acknowledgment Options.
16 . Blanton、Paxson 和 Allman,《TCP 拥塞控制》。
16. Blanton, Paxson, and Allman, TCP Congestion Control.
17 . Gont 和 Bellovin,防御序列号攻击。
17. Gont and Bellovin, Defending against Sequence Number Attacks.
18 . Huston,“处理 DNS 中的 IPv6 碎片”。
18. Huston, “Dealing with IPv6 Fragmentation in the DNS.”
在分层和/或模块化系统中,必须有某种方式将一层中的服务或实体与另一层中的服务和实体相关联。图 6-1说明了这个问题。
In a layered and/or modularized system, there must be some way to relate services or entities in one layer to services and entities in another. Figure 6-1 illustrates the problem.
在图 6-1中:
In Figure 6-1:
• A、D 和E 如何发现他们应该用于其接口的IP 地址?
• How can A, D, and E discover the IP address they should be using for their interfaces?
• D 如何发现用于向E 发送数据包的媒体访问控制(MAC)、物理或较低层协议地址?
• How can D discover the Media Access Control (MAC), physical, or lower layer protocol address it should use to send packets to E?
• 在D 上运行的client1.example 如何发现用于访问www.service1.example 的Internet 协议(IP) 地址?
• How can client1.example, which is running on D, discover the Internet Protocol (IP) address it should use to reach www.service1.example?
• 如果地址不在同一线路或网段上,D 和E 如何发现应将流量发送到哪个地址?
• How can D and E discover what address they should send traffic to if it is not on the same wire or segment?
这些问题中的每一个都代表了层间发现的不同部分。虽然这些问题看似无关,但它们实际上代表了同一组问题,在网络或协议栈的不同层具有一组狭窄的可用解决方案。本章将考虑这些问题的一系列可能的解决方案,包括每个解决方案的示例。
Each of these problems represents a different part of interlayer discovery. While these problems may seem unrelated, they actually represent the same set of problems, with a narrow set of available solutions, at different layers of a network or protocol stack. This chapter will consider a range of possible solutions for these problems, including examples of each solution.
本章最后将讨论默认网关问题;虽然严格来说这并不是一个层间发现问题,但了解 IP 网络的运行方式仍然很重要。
This chapter will end with a section on the default gateway problem; while this is not strictly an interlayer discovery problem, it is still important to understanding how an IP network operates.
层间发现问题空间看起来是一大堆不相关的问题而不是单个问题的主要原因是它分布在许多不同的层上;网络协议栈中的每组层都需要能够发现“此”层的哪个服务或实体与某个较低层的哪个服务或实体相关。描述这组问题的另一种方法是能够绘制一个一层的标识符到另一层的标识符——标识符映射。由于在最广泛部署的协议栈中至少有三对协议(并且可能或可以说是八对),因此必须部署多种解决方案来解决不同地方的同一组层间发现问题。两个定义将有助于理解解决方案的范围以及该领域实际部署的协议和系统:
The main reason the interlayer discovery problem space appears to be a large set of unrelated problems, rather than a single problem, is that it is spread across many different layers; each set of layers in a network protocol stack needs to be able to discover which service or entity at “this” layer relates to which service or entity at some lower layer. Another way to describe this set of problems is the ability to map an identifier at one layer to an identifier at another layer—identifier mapping. As there are at least three pairs of protocols in most widely deployed protocol stacks (and potentially, or arguably, eight), a wide variety of solutions must be deployed to solve the same set of interlayer discovery problems in different places. Two definitions will be helpful in understanding the range of solutions, and actual deployed protocols and systems in this space:
•标识符是唯一标识一个实体的一组数字或字母(例如字符串)。
• An identifier is a set of numbers or letters (such as a string) that uniquely identify an entity.
• 设备,无论是真实的还是虚拟的,从网络的角度来看似乎是单一目的地,在考虑一般问题和解决方案时将被称为实体,而在考虑特定解决方案时将被称为主机或服务。
• A device, whether real or virtual, which appears to be a single destination from the point of view of the network will be called an entity when considering generic problems and solutions, and hosts or services when considering specific solutions.
有四种不同的方法来解决层间发现和地址分配问题:
There are four different ways to solve the interlayer discovery and address assignment problems:
• 使用众所周知的和/或手动配置的标识符
• Using well-known and/or manually configured identifiers
• 将信息存储在映射数据库中,服务可以访问该数据库以在不同类型的标识符之间进行映射
• Storing the information in a mapping database that services can access to map between different kinds of identifiers
• 通告协议中两个标识符之间的映射
• Advertising a mapping between two identifiers in a protocol
• 从一种标识符计算另一种标识符
• Calculating one kind of identifier from another
这些解决方案不仅适用于发现,还适用于标识符分配。当主机连接到网络或启动服务时,它必须以某种方式确定如何识别自己,例如,在连接到本地网络时应使用哪个 Internet 协议版本 6 (IPv6) 地址。解决该问题的可用解决方案是相同的四种解决方案。
These solutions not only apply to discovery, but also identifier assignment. When a host is connected to a network, or a service is spun up, it must somehow determine how it should identify itself—for instance, what Internet Protocol version 6 (IPv6) address it should use when connecting to the local network. The solutions available for solving this problem are the same four solutions.
以下各节将考虑这四种解决方案。
These four solutions will be considered in the following sections.
所选择的解决方案通常取决于标识符的范围、需要分配的标识符的绝对数量以及标识符更改的速率。如果
The solution chosen often depends on the scope of the identifiers, the sheer number of identifiers that need to be assigned, and the rate at which the identifiers change. If
• 标识符被广泛使用,特别是在协议实现中,如果没有就层间映射达成某种协议,网络将根本无法工作,并且……
• The identifiers are widely used, especially in protocol implementations, and the network will simply not work without some agreement on the interlayer mappings, and…
• The number of mappings between identifiers is relatively small, and…
• 标识符通常是稳定的——特别是,它们永远不会以需要修改现有的、已部署的实现来使网络继续运行的方式发生改变,然后……
• The identifiers are generally stable—in particular, they are never changed in a way that requires existing, deployed implementations to be modified in order to allow the network to continue functioning, then…
最简单的解决方案是手动维护某种映射表。
The easiest solution is to manually maintain a mapping table of some kind.
例如,传输控制协议(TCP)承载许多高层传输协议。将各个承载协议与端口号相关联的问题是一个全局层间发现问题:实际网络中部署的每个 TCP 实现都必须能够就特定端口号上可访问哪些服务达成一致,以便网络“工作”。然而,层间映射的范围非常小,需要将几千个端口号映射到服务,并且相当静态(不经常添加新协议或服务)。那么,这个具体问题很容易通过手动管理的映射表来解决。
For instance, the Transmission Control Protocol (TCP) carries a number of higher layer transport protocols. The problem of relating individual carried protocols to port numbers is a global interlayer discovery problem: every implementation of TCP deployed in a real network must be able to agree on what services are reachable on specific port numbers for the network to “work.” The range of interlayer mappings, however, is very small, a few thousand port numbers need to be mapped to services, and fairly static (new protocols or services are not often added). This specific problem, then, is easy to solve through a manually managed mapping table.
TCP 端口号映射表由互联网号码分配机构 (IANA) 在互联网工程任务组 (IETF) 的指导下维护;该表的一部分如图6-2所示。1
The mapping table for TCP port numbers is maintained by the Internet Assigned Numbers Authority (IANA), at the direction of the Internet Engineering Task Force (IETF); a part of this table is shown in Figure 6-2.1
图6-2中,echo服务分配的端口为7;该服务用于提供第 5 章“高层数据传输”末尾所述的ping功能。
In Figure 6-2, the echo service is assigned port 7; this service is used to provide the ping functionality described at the end of Chapter 5, “Higher Layer Data Transports.”
如果表中的条目数量变得足够大,参与维护表的人数变得足够多,或者信息足够动态,需要在需要映射时学习,而不是在一块时学习。部署了一定数量的软件后,动态构建和分发数据库就有意义了。这样的系统应该包括同步数据库分区的协议,以向外部查询提供一致的视图,并且主机和服务可以使用协议来使用一个标识符查询数据库,以从网络的不同层发现匹配的标识符。
If the number of entries in the table becomes large enough, the number of people involved in maintaining the table becomes large enough, or the information is dynamic enough that it needs to be learned at the time the mapping is required, rather than when a piece of software is deployed, it makes sense to build and distribute a database dynamically. Such a system should include protocols to synchronize database partitions to present a consistent view to external queries, and protocols hosts and services can use to query the database with one identifier to discover the matching identifier from a different layer of the network.
动态映射数据库可以通过手动配置或自动过程(例如收集有关网络状态的信息并将结果信息存储在动态数据库中的发现过程)接受输入。它们也可以是分布式的,这意味着数据库的副本或部分存储在许多不同的主机或服务器上,或者集中式,这意味着数据库存储在少量的主机或服务器上。
Dynamic mapping databases may accept input through manual configuration or automated processes (such as a discovery process that gathers information about the state of the network and stores the resulting information in the dynamic database). They may also either be distributed, which means copies or portions of the database are stored on a number of different hosts or servers, or centralized, which means the database is stored on a small number of hosts or servers.
域名系统 (DNS) 被描述为基于动态分布式数据库的身份映射服务的示例。动态主机配置协议(DHCP)被描述为主要用于地址分配的类似系统的示例。
The Domain Name System (DNS) is described as an example of an identity mapping service based on a dynamic, distributed database. The Dynamic Host Configuration Protocol (DHCP) is described as an example of a similar system used primarily for the assignment of addresses.
如果映射问题的范围可以得到控制,但身份对的数量很大,或者可以快速变化,那么创建允许实体直接从设备请求映射信息的单一协议可能是最佳解决方案。例如,在图 6-1中,D 可以直接询问 E 其本地 MAC(或物理)地址是什么。
If the scope of the mapping problem can be contained, but the number of identity pairs is large, or can change rapidly, then creating a single protocol that allows entities to request mapping information from a device directly can be an optimal solution. For instance, in Figure 6-1, D could ask E directly what its local MAC (or physical) address is.
互联网协议版本 4 (IPv4) 地址解析协议 (ARP) 是此类解决方案的一个很好的示例,IPv6 邻居发现 (ND) 协议也是如此。这些示例将在后面的部分中进行更详细的讨论。
The Internet Protocol version 4 (IPv4) Address Resolution Protocol (ARP) is a good example of this kind of solution, as is the IPv6 Neighbor Discovery (ND) protocol. These examples are considered in more detail in later sections.
在一些情况下,可以根据另一层中的地址或标识符来计算一层中的地址或标识符。很少有系统使用这种技术来映射地址;大多数使用此技术的系统都是为了分配地址而这样做的。此类系统的一个示例是无状态地址自动配置 (SLAAC),这是一种 IPv6 协议,主机可以使用它来确定应将哪个 IPv6 地址分配给接口,这将作为本章后面 IPv6 ND 讨论的一部分进行更详细的考虑。 。
In some cases, it is possible to calculate an address or identifier at one layer from the address or identifier in another layer. Few systems use this technique for mapping addresses; most systems that use this technique do so in order to assign an address. One example of this type of system is Stateless Address Autoconfiguration (SLAAC), an IPv6 protocol hosts can use to determine what IPv6 address should be assigned to an interface, which is considered in more detail as part of the IPv6 ND discussion later in the chapter.
使用下层地址来计算上层地址的另一个示例是国际标准化组织(ISO)协议套件中的终端系统地址的形成,例如中间系统到中间系统(IS-IS)。第 16 章“链路状态和路径向量控制平面”中更详细地讨论了该示例。
Another example of using a lower layer address to calculate an upper layer address is in the formation of end-system addresses in the International Organization for Standardization (ISO) suite of protocols, such as Intermediate System to Intermediate System (IS-IS). This example is considered in more detail in Chapter 16, “Link State and Path Vector Control Planes.”
以下各节将考虑提供层间发现和地址分配的协议的四个示例。
Four examples of protocols providing interlayer discovery and address assignment are considered in the following sections.
DNS 在人类可读的字符串之间进行映射,例如名称服务1。图 6-1中使用的示例为 IP 地址。DNS系统的基本操作如图6-3所示。
DNS maps between human-readable character strings, such as the name service1. example used in Figure 6-1, to IP addresses. Figure 6-3 illustrates the basic operation of the DNS system.
在图 6-3中,假设没有任何类型的缓存(因此说明了整个过程):
In Figure 6-3, assuming there are no caches of any kind (so the entire process is illustrated):
1. 主机 A 尝试连接到www.service1.example。主机的操作系统检查其本地配置以查找它应查询的 DNS 服务器地址,以发现该服务所在的位置,并找到递归服务器的地址。主机操作系统的 DNS 应用程序向该地址发送 DNS 查询。
1. A host, A, attempts to connect to www.service1.example. The host’s operating system examines its local configuration for the address of the DNS server it should query to discover where this service is located, and finds the address of the recursive server. The host operating system’s DNS application sends a DNS query to this address.
2. 递归服务器接收此查询,并在没有缓存的情况下检查正在请求地址的域名。递归服务器注意到域名的右侧部分是example,因此它询问根服务器在哪里可以找到有关示例域的信息。
2. The recursive server receives this query and—given there are no caches—examines the domain name for which an address is being requested. The recursive server notes the right-hand portion of the domain name is example, so it asks a root server where to find information on the example domain.
3. 根服务器返回包含顶级域(TLD)示例信息的服务器地址。
3. The root server returns the address of the server containing information about the top-level domain (TLD) example.
4. 递归服务器现在请求有关与service1.example联系的服务器的信息。递归服务器一次处理域名的一个部分,使用发现的有关右侧名称部分的信息来发现向哪个服务器询问左侧的信息。这个过程称为通过域名递归;因此该服务器称为递归服务器。
4. The recursive server now requests information about which server to contact about service1.example. The recursive server proceeds through the domain name one section at a time, using information discovered about the section of the name to the right to discover which server to ask about the information to the left. This process is called recursing through the domain name; hence the server is called a recursive server.
5. TLD 服务器返回service1的权威服务器地址。例子。如果有关服务位置的信息已从先前的请求中缓存,则该信息将作为非权威答案返回;如果实际服务器配置为保存有关域答复的信息,则其答复是权威的。
5. The TLD server returns the address of the authoritative server for service1. example. If information about the location of a service has been cached from a prior request, it is returned as a nonauthoritative answer; if the actual server configured to hold the information about a domain replies, its answer is authoritative.
6. 递归服务器向权威服务器请求www.service1.example的信息。
6. The recursive server requests information about www.service1.example from the authoritative server.
7. 权威服务器回复服务器B的IP地址。
7. The authoritative server responds with the IP address of server B.
8. 递归服务器现在使用正确的信息响应主机 A,以到达所请求的服务。
8. The recursive server now responds to the host, A, with the correct information to reach the requested service.
9. 主机 A 联系在IP 地址 2001:db8:3e8:100::1 上运行www.service1.example的服务器。
9. The host, A, contacts the server on which www.service1.example is running on the IP address 2001:db8:3e8:100::1.
这个过程可能看起来非常漫长;例如,为什么不将所有信息保留在根服务器上以节省大量步骤?然而,这违反了 DNS 的基本思想,即尽可能将每个域的信息保留在域所有者的控制之下。此外,这将使根服务器的构建和维护变得非常昂贵,因为它们需要能够保存数百万条记录并每天回答数亿次 DNS 信息查询。信息分离允许每个所有者控制他的数据并使 DNS 系统能够扩展。
This process may appear to be very drawn out; for instance, why not just keep all the information on the root server to save a lot of steps? This would violate the basic idea of DNS, however, which is to keep information about each domain in the control of the domain owner as much as possible. Further, this would make the building and maintenance of the root servers very expensive, as they would need to be capable of holding millions of records and answer hundreds of millions of queries for DNS information each day. The separation of information allows each owner to control his data and enables the DNS system to scale.
通常,通过 DNS 查询过程返回的信息会被一路上的每个服务器缓存,因此主机不需要每次需要到达新服务器时都请求映射。
Normally, the information returned through a DNS query process is cached by each server along the way, so the mapping does not need to be requested each time the host needs to reach a new server.
这些 DNS 表是如何维护的?通常是通过域和顶级域所有者以及世界各地的边缘提供商的手动工作。DNS 不会自动发现连接到网络的每个实体的名称以及每个实体的地址。
How are these DNS tables maintained? Usually through the manual work of domain- and top-level domain owners, as well as edge providers all across the world. DNS does not automatically discover the name of each entity attached to the network and what each one’s address is.
DNS 将手动维护的数据库与用于查询数据库的协议配对,该数据库的工作分散在许多不同的人手中;因此,DNS 属于具有协议类解决方案的映射数据库。主机如何知道要查询哪个 DNS 服务器?此信息可以手动配置,也可以通过 IPv6 ND 或 DHCP 等发现协议获知。
DNS pairs a manually maintained database, with the work spread out among many different pairs of hands, with a protocol used to query the database; hence DNS falls into the mapping database with a protocol class of solutions. How does a host know what DNS server to query? This information is either manually configured or learned through a discovery protocol such as IPv6 ND or DHCP.
当主机(或某些其他设备)首次连接到网络时,它如何知道要分配给本地接口的 IPv6 地址(或 IPv6 地址集)?此问题的一种解决方案是主机向某个数据库发送查询以发现它应该使用哪些地址,例如 DHCPv6。要了解 DHCPv6,首先要了解 IPv6 中链路本地地址的概念。在第 5 章“高层数据传输”中关于 IPv6 地址空间大小的讨论中,fe80::/10 被称为保留用于链路本地寻址。形成链接本地地址,运行 IPv6 的设备将 fe80:: 前缀与 MAC(或物理)地址组合在一起,该地址通常格式化为 EUI-48 地址,有时格式化为 EUI-64 地址(请参阅第 4 章“较低层传输” ),”以获取有关 EUI 地址的信息)。例如:
When a host (or some other device) first connects to a network, how does it know which IPv6 address (or set of IPv6 addresses) to assign to the local interface? One solution to this problem is for the host to send a query to some database to discover what addresses it should use, such as DHCPv6. To understand DHCPv6, it is important to begin with the concept of a link local address in IPv6. In the discussion on the size of the IPv6 address space in Chapter 5, “Higher Layer Data Transports,” fe80::/10 was called out as being reserved for link local addressing. To form a link local address, a device running IPv6 combines the fe80:: prefix with the MAC (or physical) address, which is often formatted as an EUI-48 address, and sometimes as an EUI-64 address (see Chapter 4, “Lower Layer Transports,” for information on EUI addresses). For instance:
• 设备具有EUI-48 地址为01-23-45-67-89-ab 的接口。
• A device has an interface with the EUI-48 address 01-23-45-67-89-ab.
• 该接口连接到IPv6 网络。
• This interface is connected to an IPv6 network.
• 设备可以分配fe80::123:4567:89ab 作为链路本地地址,并使用该地址仅与该网段上的其他设备进行通信。
• The device can assign fe80::123:4567:89ab as a link local address and use this address to communicate to other devices on this segment only.
这是从一个标识符计算另一个标识符的示例。一旦形成链路本地地址,DHCP6 就是一种可用于获取网络内(或全局,取决于网络配置)唯一地址的方法。DHCPv6 使用用户数据报协议 (UDP) 进行其底层传输。如图 6-4所示。
This is an example of calculating one identifier from another. Once the link local address has been formed, DHCP6 is one method that can be used to obtain a unique address within the network (or globally, depending on the configuration of the network). DHCPv6 uses the User Datagram Protocol (UDP) for its lower layer transport. Figure 6-4 illustrates.
在图 6-4中:
In Figure 6-4:
1. 刚刚连接到网络的主机 A 发送请求消息。该消息源自链路本地地址,发送到多播地址 ff02::1:2、UDP 端口 547(对于服务器)和 546(对于客户端),因此连接到同一物理线路的每个设备都会收到该消息。信息。此消息将包括客户端形成的 DHCP 唯一标识符 (DUID),2和服务器用来确保它始终与同一设备进行通信。
1. The host that has just connected to the network, A, sends a solicit message. This message is sourced from the link local address and sent to the multicast address ff02::1:2, UDP ports 547 (for the server) and 546 (for the client), so every device connected to the same physical wire will receive the message. This message will include a DHCP Unique Identifier (DUID), which the client forms,2 and the server uses to ensure it is consistently communicating with the same device.
2. B 和 C 均配置为充当 DHCPv6 服务器,并以通告消息进行响应。该消息是指向 A 本身的单播数据包,使用 A 发出请求消息的链路本地地址。
2. B and C, both of which are configured to act as DHCPv6 servers, respond with an advertise message. This message is a unicast packet directed at A itself, using the link local address from which A sources the solicit message.
3. 主机 A 选择两台服务器之一来请求地址。主机向多播地址 ff02::1:2 发送请求,要求 B 为其提供一个地址(或一个地址池)、要使用哪个 DNS 服务器的信息等。
3. Host A chooses one of the two servers from which to request an address. The host sends a request to the multicast address ff02::1:2, asking B to provide it with an address (or a pool of addresses), information on which DNS server to use, etc.
4. 然后,运行在 B 上的服务器向最初形成的链路本地地址 A发出回复;这验证 B 已从其本地池分配资源并允许 A 开始使用它们。
4. The server, running on B, then responds with a reply to the link local address A initially formed; this verifies B has allocated the resources from its local pool and allows A to start using them.
如果网段上没有设备配置为 DHCPv6 服务器,会发生什么情况?例如,在图 6-4中,如果 D 是唯一可用的 DHCPv6 服务器,因为 DHCPv6 未在 B 或 C 上运行,该怎么办?在这种情况下,路由器(甚至其他主机或设备)可以充当DHCPv6 中继。A发送的DHCPv6报文将被中继接收并封装后发送给DHCPv6服务器进行处理。
What happens if no device on the segment is configured as a DHCPv6 server? For instance, in Figure 6-4, what if D is the only available DHCPv6 server because DHCPv6 is not running on B or C? In this case, a router (or even some other host or device) can act as a DHCPv6 relay. The DHCPv6 packets that A transmits will be received by the relay, encapsulated, and transmitted to the DHCPv6 server for processing.
笔记
Note
这里描述的过程称为有状态 DHCP,通常在路由器通告中设置托管位时触发。DHCPv6 还可以与 SLAAC 配合使用(稍后将在“IPv6 邻居发现”部分中进行介绍),以提供 SLAAC 在无状态 DHCPv6 模式下无法提供的信息。当路由器通告中设置了其他位时,通常使用此模式。IETF 草案《地址和 DNS 配置上的 DHCPv6/SLAAC 交互问题》描述了这种交互以及这两种机制之间交互中的问题。3
The process described here is called stateful DHCP and is normally triggered when the Managed bit is set in the router advertisement. DHCPv6 can also work with SLAAC, described later in the “IPv6 Neighbor Discovery” section, to provide information SLAAC does not provide in the stateless DHCPv6 mode. This mode is normally used when the Other bit is set in the router advertisement. The IETF draft DHCPv6/SLAAC Interaction Problems on Address and DNS Configuration describes this interaction and problems in the interaction between these two mechanisms.3
如果网络管理员知道所有 IPv6 地址都将通过 DHCPv6 配置,并且每个 IPv6 地址上只有一个 DHCPv6 服务器可用段,可以通过启用 DHCPv6快速提交来跳过广告和请求消息。
In cases where the network administrator knows all IPv6 addresses will be configured through DHCPv6, and only one DHCPv6 server will be available on each segment, the advertise and request messages can be skipped by enabling DHCPv6 rapid commit.
尽管 IPv6 是本书的重点,但在某些情况下 IPv4 提供了有用的解决方案示例;IPv4 地址解析协议 (ARP) 就是这样的例子。ARP 是一个非常简单的协议,用于解决层间发现问题,而不依赖于任何类型的服务器。下面用图6-5来解释ARP的操作。
Although IPv6 is the focus of this book, there are some instances where IPv4 provides a useful example of a solution; the IPv4 Address Resolution Protocol (ARP) is one such case. ARP is a very simple protocol used to solve interlayer discovery without relying on a server of any type. Figure 6-5 will be used to explain the operation of ARP.
假设 A 希望向 C 发送数据包。仅了解 C 的 IPv4 地址 203.0.113.12 不足以让 A 正确形成数据包并将其放置在通往 C 的线路上。要正确构建数据包,A 还必须知道
Assume A would like to send a packet to C. Knowing C’s IPv4 address, 203.0.113.12, is not enough for A to properly form a packet to place on the wire toward C. To properly build a packet, A must also know
• C 是否与A 在同一根线上
• Whether or not C is on the same wire as A
• C 的 MAC(或物理)地址
• The MAC, or physical, address of C
如果没有两条信息,A 不知道如何在线路上封装数据包,因此 C 将实际接收该数据包,而 B 将忽略它。A怎样才能发现这些信息呢?第一个问题,C是否在同一个线路为 A,可以通过考虑本地接口 IP 地址、目标 IP 地址和子网掩码来回答。本章稍后将对此进行更详细的讨论。
Without two pieces of information, A does not know how to encapsulate the packet on the wire, so C will actually receive the packet, and B will ignore it. How can A discover this information? The first question, whether or not C is on the same wire as A, can be answered by considering the local interface IP address, the destination IP address, and the subnet mask. This is considered in more detail later in this chapter.
ARP解决第二个问题,将目的IP地址与目的MAC地址进行匹配,过程如下:
ARP solves the second problem, matching the destination IP address to the destination MAC address, with the following process:
1. 主机 A 向线路上的每个设备发送包含 IPv4 地址但不包含 MAC 地址的广播数据包。这是一个ARP请求;是A请求203.0.113.12对应的MAC地址。
1. Host A sends a broadcast packet to every device on the wire containing the IPv4 address, but not the MAC address. This is an ARP request; it is A’s request for the MAC address corresponding to 203.0.113.12.
2. B 和 D 收到该数据包,但没有响应,因为它们的本地接口都没有地址 203.0.113.12。
2. B and D receive this packet, but do not respond, because none of their local interfaces have the address 203.0.113.12.
3. 主机 C 接收该数据包并再次使用单播数据包响应该请求。该ARP 回复包含 IPv4 地址和匹配的 MAC 地址,为 A 提供构建发往 C 的数据包所需的信息。
3. Host C receives this packet and responds, again using a unicast packet, to the request. This ARP reply contains both the IPv4 address and the matching MAC address, giving A the information needed to build packets toward C.
当 A 收到此回复时,它会将 203.0.113.12 与回复中包含的 MAC 地址之间的映射插入到本地 ARP 缓存中。该信息将被保存直至超时;ARP 缓存条目的超时规则因实现而异,并且通常可以手动配置。缓存 ARP 条目的时间长度是在网络上不经常重复相同信息(在 IPv4 到 MAC 地址映射不经常更改的情况下)与跟上 ARP 条目位置的任何变化之间的平衡。设备,在特定 IPv4 地址可能在主机之间移动的情况下。
When A receives this reply, it will insert the mapping between 203.0.113.12 and the MAC address contained in the reply in a local ARP cache. This information will be stored until it times out; the rules for timing out an ARP cache entry vary between implementations and can often be manually configured. How long to cache an ARP entry is a balance between not repeating the same information too often on the network, in the case where the IPv4-to-MAC address mapping does not change very often, and keeping up with any changes in the location of a device, in the case where a particular IPv4 address may move between hosts.
任何收到 ARP 回复的设备都可以接受该数据包并缓存其包含的信息。例如,B收到C的ARP回复后,可以将203.0.113.12和C的MAC地址之间的映射插入到其ARP缓存中。事实上,ARP 的这一属性通常用于在连接到网络时加快设备的发现速度。ARP 规范中没有规定主机必须等待 ARP 请求才能发送 ARP 回复。当设备连接到网络时,它可以简单地发送带有正确映射信息的 ARP 回复,以使与同一线路上的其他主机的初始连接过程更快;这称为免费 ARP。
Any device receiving an ARP reply can accept the packet and cache the information it contains. For instance, B, on receiving the ARP reply from C, can insert the mapping between 203.0.113.12 and C’s MAC address into its ARP cache. In fact, this property of ARP is often used to speed up the discovery of devices when they are attached to a network. There is nothing in the ARP specification that requires a host to wait for an ARP request to send an ARP reply. When a device connects to a network, it can simply send an ARP reply with the correct mapping information to make the initial connection process to other hosts on the same wire faster; this is called a gratuitous ARP.
免费 ARP 对于重复地址检测 (DAD) 也很有用;如果主机收到带有其正在使用的 IPv4 地址的 ARP 回复,它将报告重复的 IPv4 地址。在这种情况下,某些实现还会发送一系列免费 ARP,以防止该地址被使用,或者强制其他主机也报告重复的地址。
Gratuitous ARPs are also useful for Duplicate Address Detection (DAD); if a host receives an ARP reply with an IPv4 address it is using, it will report a duplicate IPv4 address. Some implementations will also send out a series of gratuitous ARPs in this case, in order to prevent the address from being used, or force the other host to also report the duplicate address.
如果主机A使用ARP请求一个不在同一网段的地址,如图6-5中的198.51.100.101,会发生什么情况?这种情况有两种不同的可能性:
What happens if Host A requests an address using ARP that is not on the same segment, such as 198.51.100.101 in Figure 6-5? There are two different possibilities to this situation:
• 如果D 配置为作为代理ARP 进行应答,则它可以使用连接到该网段的MAC 地址来响应ARP 请求。然后,A 将缓存此响应,将任何发往 E 的流量发送到 D 的 MAC 地址,然后 D 可以将此流量转发到 E。大多数广泛部署的实现默认情况下不启用代理 ARP。
• If D is configured to answer as a proxy ARP, it can respond to the ARP request with the MAC address connected to the segment. A will then cache this response, sending any traffic destined to E to the MAC address of D, which can then forward this traffic on to E. Most widely deployed implementations do not enable proxy ARP by default.
• A 可以将流量发送到其默认网关,该默认网关是本地连接的路由器,应该知道到网络上任何目的地的路径。
• A could send the traffic to its default gateway, which is a locally connected router that should know the path to any destination on the network.
IPv4 ARP 是通过将两个标识符包含在单个协议中来映射层间标识符的协议示例。
IPv4 ARP is an example of a protocol that maps interlayer identifiers by including both identifiers in a single protocol.
IPv6 使用一系列 Internet 控制消息协议 (ICMP) v6 消息取代了更简单的 ARP 协议。定义了五种 ICMPv6 消息:
IPv6 replaces the simpler ARP protocol with a series of Internet Control Message Protocol (ICMP) v6 messages. Five kinds of ICMPv6 messages are defined:
• 类型 133,路由器请求
• Type 133, Router Solicitation
• 类型 134,路由器通告
• Type 134, Router Advertisement
• 类型 135,邻居征集
• Type 135, Neighbor Solicitation
• 136 型,邻居广告
• Type 136, Neighbor Advertisement
• 类型 137,重定向
• Type 137, Redirect
图6-6用于解释IPv6 ND的操作。
Figure 6-6 is used to explain the operation of IPv6 ND.
要了解 IPv6 ND 的操作,最好跟踪连接到新网络的单个主机。以图6-6中的主机A为例。
To understand the operation of IPv6 ND, it is best to follow a single host as it is connected to a new network. Host A in Figure 6-6 is used as an example.
• A 将首先形成一个链路本地地址,如前所述;假设 A 选择 fe80::AAAA 作为其链路本地地址。
• A will begin by forming a link local address, as described previously; assume A chooses fe80::AAAA as its link local address.
• A 现在使用该链路本地地址作为源地址,并向链路本地多播地址(所有节点多播地址)发送路由器请求;这是 ICMPv6 消息类型 133。
• A now uses this link local address as a source address and sends a router solicitation to a link local multicast address (the all nodes multicast address); this is an ICMPv6 message type 133.
• B 和D 接收此路由器请求并用路由器通告进行响应,该路由器通告是ICMPv6 消息类型134。此单播数据包被传输到用作源地址fe80::AAAA 的链路本地地址A。
• B and D receive this router solicitation and respond with a router advertisement, which is an ICMPv6 message type 134. This unicast packet is transmitted to the link local address A used as the source address, fe80::AAAA.
•路由器通告包含有关新连接的主机应如何以多个标志的形式确定其本地配置信息的信息。
• The router advertisement contains information on how the newly connected host should determine its local configuration information in the form of several flags.
• M标志指示主机应通过DHCPv6 请求地址,因为这是受管理的链路。
• The M flag indicates the host should request an address through DHCPv6, because this is a managed link.
• O标志表示主机可以通过DHCPv6 检索除应使用的地址之外的信息。例如,主机用于解析 DNS 名称的 DNS 服务器应使用 DHCPv6 进行检索。
• The O flag indicates the host can retrieve information other than the address it should use via DHCPv6. For instance, the DNS server the host should use to resolve DNS names should be retrieved using DHCPv6.
• 如果设置了O标志而不是M标志,则A 必须确定其自己的接口IPv6 地址。为此,它通过检查路由器通告中的前缀信息字段来确定该网段上使用的 IPv6 前缀集。它选择这些前缀之一,并使用与形成链路本地地址相同的过程形成 IPv6 地址:它将本地 MAC(EUI-48 或 EUI-64)地址添加到指示的前缀。这个过程称为 SLAAC。
• If the O flag is set, and not the M flag, A must determine its own interface IPv6 address. To do this, it determines the set of IPv6 prefixes in use on this segment by examining the prefix information field in the router advertisement. It chooses one of these prefixes and forms an IPv6 address using the same process it used to form a link local address: it adds a local MAC (EUI-48 or EUI-64) address to the indicated prefix. This process is called SLAAC.
• 主机现在必须确保它没有选择同一网络上其他主机正在使用的地址;它必须执行 DAD。要执行重复地址检测:
• The host must now make certain it has not chosen an address some other host on the same network is using; it must perform DAD. To perform a duplicate address detection:
•主机使用刚刚形成的IPv6 地址发送一系列邻居请求消息,并询问相应的MAC(物理)地址。这些是从已分配给接口的链路本地地址传输的 ICMPv6 类型 135 消息。
• The host sends a series of neighbor solicitation messages using the just-formed IPv6 address and asking for the corresponding MAC (physical) address. These are ICMPv6 type 135 messages transmitted from the link local address already assigned to the interface.
•如果主机使用相同的 IPv6 地址接收到邻居通告或邻居请求,则它假定本地形成的地址是重复的。在这种情况下,它将使用不同的本地 MAC 地址形成新地址并重试。
• If the host receives a neighbor advertisement or neighbor solicitation using the same IPv6 address, it assumes the locally formed address is a duplicate; in this case, it will form a new address using a different local MAC address and try again.
•如果主机没有收到响应,也没有收到使用同一地址的另一主机的邻居请求,则它假定该地址是唯一的,并将新形成的地址分配给该接口。
• If the host does not receive a response, nor another host’s neighbor solicitation using the same address, it assumes the address is unique and assigns the newly formed address to the interface.
一旦有了用于传输数据的地址,A 在向同一网段上的另一台主机发送信息之前,现在还需要一条信息——接收主机的 MAC 地址。例如,如果 A 想要向 C 发送数据包,它将首先向 C 发送多播邻居请求消息,询问其 MAC 地址;这是一个 ICMPv6 消息类型 135。当 C 收到此消息时,它将使用正确的 MAC 地址进行响应,以发送请求的 IPv6 地址的流量;这是 ICMPv6 消息类型 136。
Once it has an address to transmit data from, A now needs one more piece of information before sending information to another host on the same segment—the MAC address of the receiving host. If A, for instance, wants to send a packet to C, it will begin by sending a multicast neighbor solicitation message to C asking for its MAC address; this is an ICMPv6 message type 135. When C receives this message, it will respond with the correct MAC address to send traffic for the requested IPv6 address; this is an ICMPv6 message type 136.
虽然前面的过程描述了响应路由器请求而发送路由器通告,但每个路由器将在每个连接的接口上发送定期路由器通告。路由器通告包含生命周期字段,指示路由器通告的有效时间。
While the preceding process describes router advertisements being sent in response to a router solicitation, each router will send periodic router advertisements on each attached interface. The router advertisement contains a lifetime field, indicating how long the router advertisement is valid.
主机如何知道是尝试通过其连接的网段向主机发送数据包,还是将数据包发送到路由器进行进一步处理?如果主机应将数据包发送到路由器进行进一步处理,它如何知道将流量发送到哪个路由器(如果有多个路由器)?这两个问题共同构成了默认网关问题。
How can a host know whether to try to send a packet to a host over the segment it is connected to, or to send the packet to a router for further processing? If a host should send packets to a router for further processing, how can it know which router (if there is more than one) to send the traffic to? These two problems, together, make up the default gateway problem.
对于 IPv4,使用前缀和前缀长度可以很容易地解决这个问题。如图 6-7所示。
For IPv4, the problem is fairly easy to solve using the prefix and prefix length. Figure 6-7 illustrates.
IPv4 实现假定同一 IPv4 子网内的任何主机都必须物理连接到同一线路。实施如何区分?子网掩码是前缀长度的另一种形式,它指示在哪里网络地址结束,主机地址开始。在本例中,假设前缀长度为 24 位,或者网络地址为 /24。24告诉您子网掩码中设置了多少位:
IPv4 implementations assume any host within the same IPv4 subnet must be physically connected to the same wire. How can the implementation tell the difference? The subnet mask is another form of the prefix length, which indicates where the network address ends and the host address begins. In this case, assume the prefix length is 24 bits, or the network address is a /24. The 24 tells you how many bits are set in the subnet mask:
24 位 = 11111111.11111111.11111111.0000000
24 bits = 11111111.11111111.11111111.0000000
由于 IPv4 使用“点分十进制”表示法,因此也可以写为 255.255.255.0。为了确定 C 是否与 A 在同一根线上,A 将
Since IPv4 uses a “dotted decimal” notation, this can also be written as 255.255.255.0. To discover whether or not C is on the same wire as A, A will
1. 将子网掩码与本地接口地址进行逻辑与
1. Logically AND the subnet mask with the local interface address
2. 将子网掩码与目标地址进行逻辑与
2. Logically AND the subnet mask with the destination address
3、比较两个结果;如果它们匹配,则目标主机与本地接口位于同一线路上
3. Compare the two results; if they match, the destination host is on the same wire as the local interface
如图 6-8所示。
Figure 6-8 illustrates.
图6-8中有四个IPv4地址;假设A需要向C、D和E发送数据包。如果A通过手动配置或通过DHCPv4知道本地网段的前缀长度是24位,那么它可以简单地查看每个地址的24个最高有效位,将其与自身地址的 24 个最高有效位进行比较,并确定目标是否在段上。IPv4 地址的 24 位在地址的第三部分和第四部分之间产生了良好的分隔(IPv4 地址的每个部分代表 8 位地址空间,总共 32 位地址空间)。
There are four IPv4 addresses in Figure 6-8; assume A needs to send packets to C, D, and E. If A knows the prefix length of the local segment is 24 bits either through manual configuration or through DHCPv4, then it can simply look at the 24 most significant bits of each address, compare it to the 24 most significant bits of its own address, and determine whether the destination is on segment or not. Twenty-four bits of an IPv4 address produces a nice break between the third and fourth section of the address (each section of an IPv4 address represents 8 bits of address space, for a total of 32 bits of address space).
任意两个与A的左三段相同的地址,称为网络地址,位于同一段;任何不存在的地址都不在段上。在这种情况下,A 和 C 的网络地址匹配,因此 A 会认为 C 位于同一网段,因此会将数据包直接发送到 C,而不是发送到路由器。对于 A 认为不在网段中的任何目的地,它会将数据包发送到最终目的地的 IPv4 地址,但发送到默认网关的 MAC 地址。这意味着充当默认网关的路由器将接受数据包并根据目标 IPv4 地址对其进行交换(第 7 章“数据包交换”中更全面地考虑了数据包交换)。默认网关是如何选择的?它可以手动配置或包含在 DHCPv4 选项中。
Any two addresses with the same left three sections as A has, called the network address, are on the same segment; any address that does not is not on segment. In this case, the network address for A and C match, so A will believe C is on the same segment, and hence will send packets to C directly, rather than sending them to a router. For any destination A believes is off segment, it will send packets to the final destination’s IPv4 address, but to the default gateway’s MAC address. This means the router acting as the default gateway will accept the packet and switch it based on the destination IPv4 address (packet switching is considered more fully in Chapter 7, “Packet Switching”). How is the default gateway chosen? It is either manually configured or included in a DHCPv4 option.
那么D呢?由于地址的网络部分不匹配,A 会认为 D 不在网段中。在这种情况下,A 会将 D 的任何流量发送到其默认网关,即 B。当 B 收到这些数据包时,它将意识到 A 和 D 可通过同一接口到达(基于其路由表 - 考虑构建路由表) (在第二部分“控制平面”中),因此它将向 A 发送 ICMP 重定向,告诉它直接向 D 发送流量,而不是通过 B。
What about D? Because the network portions of the addresses don’t match, A will believe D is off segment. In this case, A will send any traffic for D to its default gateway, which is B. When B receives these packets, it will realize A and D are reachable through the same interface (based on its routing table—building routing tables is considered in Part II, “The Control Plane”), so it will send an ICMP redirect to A telling it to send traffic toward D directly, rather than through B.
在考虑使用哪个默认网关时,IPv6 提出了一系列更复杂的问题需要解决,因为 IPv6 假设单个设备可能有许多分配给特定接口的 IPv6 地址。如图 6-9所示。
IPv6 presents a more complex set of problems to solve when considering which default gateway to use, because IPv6 assumes a single device may have many IPv6 addresses assigned to a particular interface. Figure 6-9 illustrates.
如图6-9所示,假设网络管理员配置了以下策略:
In Figure 6-9, assume the network administrator has configured the following policies:
• 除非主机具有 2001:db8:3e8:110::/64 地址范围内的地址,否则任何主机都无法连接到 A。
• No host may connect to A unless it has an address in the 2001:db8:3e8:110::/64 range of addresses.
• 任何主机都不能连接到D,除非它具有2001:db8:3e8:112::/64 地址范围内的地址。
• No host may connect to D unless it has an address in the 2001:db8:3e8:112::/64 range of addresses.
笔记
Note
在现实世界中你永远不会制定这样的政策;这是一个人为的情况,旨在说明最小规模网络中的问题集。同一类型的一个更实际的问题涉及单播反向路径转发(uRPF)。
You would never build policies like this in the real world; this is a contrived situation to illustrate a problem set in a minimally sized network. A much more real problem of this same type would involve unicast Reverse Path Forwarding (uRPF).
为了使这些策略发挥作用,管理员已将 110::3 和 112::12 分配给主机 C,将 111::120 分配给主机 F。这可能看起来很奇怪,但单个网段拥有多个 IPv6 子网是完全合法的在 IPv6 中分配;拥有多个地址的单个设备也是完全合法的。事实上,在 IPv6 中,在很多情况下,单个设备可能会分配一定范围的地址。
To make these policies work, the administrator has assigned 110::3 and 112::12 to host C and 111::120 to host F. This might look odd, but it is perfectly legal for a single segment to have multiple IPv6 subnets assigned in IPv6; it is also perfectly legal to have a single device with multiple addresses. In fact, in IPv6, there are many situations where a single device may have a range of addresses assigned.
但是,从前缀长度的角度来看,分配给 C 或 F 的两个地址不会位于同一子网中。因此,IPv6 不依赖前缀长度来确定哪些内容在段上,哪些不在段上。相反,IPv6 实现保留所有已连接主机的表,使用邻居请求来发现网段上的主机和不在网段上的主机。当主机想要将流量发送出本地网段时,它会将流量发送到它通过路由器通告了解的路由器之一。如果路由器收到一个数据包,并且知道该网段上的另一个路由器有更好的路由(因为路由器具有路由表,告诉它们采用哪条路径到达任何特定目的地),则路由器将发送 ICMPv6 重定向消息,告诉主机使用其他第一跳路由器到达目的地。
From the perspective of the prefix lengths, however, no two addresses assigned to C or F are on the same subnet. Because of this, IPv6 does not rely on the prefix length to determine what is on segment and what is not. Instead, IPv6 implementations keep a table of all connected hosts, using neighbor solicitations to discover what is on segment and what is not. When a host wants to send traffic off the local segment, it sends the traffic to one of the routers it has learned about through router advertisements. If a router receives a packet that it knows another router on the segment has a better route to (because the routers have routing tables that tell them which path to take to any particular destination), the router will send an ICMPv6 redirect message telling the host to use some other first hop router to reach the destination.
本章概述了一个非常困难的问题,以及许多复杂的解决方案——域名系统、动态主机配置协议、地址解析协议和邻居发现——比提供的高级概述复杂得多这里。部署及运行例如,DNS 服务器的维护和 DNS 系统的维护是网络工程中的一个完整的职业领域。
This chapter has provided an overview of a very difficult problem, and a number of complex solutions—the Domain Name System, the Dynamic Host Configuration Protocol, the Address Resolution Protocol, and Neighbor Discovery—are far more complex than the high-level overviews provided here. The deployment and operation of DNS servers and the maintenance of the DNS system are an entire career field within network engineering, for instance.
即便如此,所有这些复杂的解决方案仅代表解决将一层使用的标识符映射到另一层使用的标识符或发现标识符以促进通信的难题的四种方法中的一种。四种基本解决方案和实现这些解决方案的不同协议之间的对比是本书前提的有力示例:如果您了解问题空间,并且了解可用的解决方案,那么就可以向问题提出正确的问题。解决方案以了解其工作原理。
Even so, all of these complex solutions represent just one of four ways to solve the difficult problems of mapping the identifiers used at one layer into the identifiers used at another layer, or the discovery of identifiers in order to facilitate communication. The contrast between the four basic solutions and the diverse protocols implementing those solutions is a solid example of the premise of this book: if you understand the problem space, and you understand the available solutions, then it becomes possible to ask the right questions of a solution to understand how it works.
一旦发现了标识符,并且对要传输的数据进行了整理,就可以通过网络交换数据包了;这是下一章的主题。
Once the identifiers have been discovered, and the data to be transported has been marshaled, it is time to switch packets through the network; this is the topic of the next chapter.
阿萨蒂、拉吉夫、赫曼特·辛格、韦斯·比比、卡洛斯·皮纳塔罗、伊莱·达特和韦斯利·乔治。增强的重复地址检测。征求意见 7527。RFC 编辑,2015。doi:10.17487/RFC7527。
Asati, Rajiv, Hemant Singh, Wes Beebee, Carlos Pignataro, Eli Dart, and Wesley George. Enhanced Duplicate Address Detection. Request for Comments 7527. RFC Editor, 2015. doi:10.17487/RFC7527.
弗雷德·贝克和布莱恩·E·卡彭特。多前缀网络中主机的第一跳路由器选择。征求意见 8028。RFC 编辑,2016。doi:10.17487/RFC8028。
Baker, Fred, and Brian E. Carpenter. First-Hop Router Selection by Hosts in a Multi-Prefix Network. Request for Comments 8028. RFC Editor, 2016. doi:10.17487/RFC8028.
韦斯·毕比、赫曼特·辛格和埃里克·诺德马克。IPv6 子网模型:链路和子网前缀之间的关系。征求意见 5942。RFC 编辑,2010。doi:10.17487/RFC5942。
Beebee, Wes, Hemant Singh, and Erik Nordmark. IPv6 Subnet Model: The Relationship between Links and Subnet Prefixes. Request for Comments 5942. RFC Editor, 2010. doi:10.17487/RFC5942.
德罗姆斯、拉尔夫. IPv6 动态主机配置协议 (DHCPv6) 的 DNS 配置选项。征求意见 3646。RFC 编辑,2003。doi:10.17487/RFC3646。
Droms, Ralph. DNS Configuration Options for Dynamic Host Configuration Protocol for IPv6 (DHCPv6). Request for Comments 3646. RFC Editor, 2003. doi:10.17487/RFC3646.
———。IPv6 的无状态动态主机配置协议 (DHCP) 服务。征求意见 3736。RFC 编辑,2004 年。doi:10.17487/RFC3736。
———. Stateless Dynamic Host Configuration Protocol (DHCP) Service for IPv6. Request for Comments 3736. RFC Editor, 2004. doi:10.17487/RFC3736.
贡特,费尔南多。IPv6 碎片与 IPv6 邻居发现的安全影响。征求意见 6980。RFC 编辑,2013。doi:10.17487/RFC6980。
Gont, Fernando. Security Implications of IPv6 Fragmentation with IPv6 Neighbor Discovery. Request for Comments 6980. RFC Editor, 2013. doi:10.17487/RFC6980.
约翰逊、贾罗德和托马斯·纳尔滕博士。基于 UUID 的 DHCPv6 唯一标识符 (DUID-UUID) 的定义。征求意见 6355。RFC 编辑,2011。doi:10.17487/RFC6355。
Johnson, Jarrod, and Dr. Thomas Narten. Definition of the UUID-Based DHCPv6 Unique Identifier (DUID-UUID). Request for Comments 6355. RFC Editor, 2011. doi:10.17487/RFC6355.
詹姆斯·肯普夫、贾里·阿科、布莱恩·齐尔和佩卡·尼坎德。安全邻居发现(发送)。征求意见 3971。RFC 编辑,2005。doi:10.17487/RFC3971。
Kempf, James, Jari Arkko, Brian Zill, and Pekka Nikander. SEcure Neighbor Discovery (SEND). Request for Comments 3971. RFC Editor, 2005. doi:10.17487/RFC3971.
刘兵、蒋盛、龚向阳、王文东和恩诺·雷伊。“地址和 DNS 配置方面的 DHCPv6/SLAAC 交互问题。” 互联网草案。互联网工程任务组,2016 年 8 月。https: //datatracker.ietf.org/doc/html/draft-ietf-v6ops-dhcpv6-slaac-problem-07。
Liu, Bing, Sheng Jiang, Xiangyang Gong, Wendong Wang, and Enno Rey. “DHCPv6/SLAAC Interaction Problems on Address and DNS Configuration.” Internet-Draft. Internet Engineering Task Force, August 2016. https://datatracker.ietf.org/doc/html/draft-ietf-v6ops-dhcpv6-slaac-problem-07.
Mrugalski、Tomek、Marcin Siodelski、Bernie Volz、Andrew Yourtchenko、Michael Richardson、Sheng Jiang、Ted Lemon 和 Timothy Winters。“IPv6 动态主机配置协议 (DHCPv6) 之二。” 互联网草案。互联网工程任务组,2017 年 6 月。https: //datatracker.ietf.org/doc/html/draft-ietf-dhc-rfc3315bis-09。
Mrugalski, Tomek, Marcin Siodelski, Bernie Volz, Andrew Yourtchenko, Michael Richardson, Sheng Jiang, Ted Lemon, and Timothy Winters. “Dynamic Host Configuration Protocol for IPv6 (DHCPv6) bis.” Internet-Draft. Internet Engineering Task Force, June 2017. https://datatracker.ietf.org/doc/html/draft-ietf-dhc-rfc3315bis-09.
Narten、Thomas 博士、Tatsuya Jinmei 和 Susan Thomson 博士。IPv6 无状态地址自动配置。征求意见 4862。RFC 编辑,2007。doi:10.17487/RFC4862。
Narten, Dr. Thomas, Tatsuya Jinmei, and Dr. Susan Thomson. IPv6 Stateless Address Autoconfiguration. Request for Comments 4862. RFC Editor, 2007. doi:10.17487/RFC4862.
诺德马克、埃里克和伊戈尔·加辛斯基。邻居不可达检测太不耐烦了。征求意见 7048。RFC 编辑,2014。doi:10.17487/RFC7048。
Nordmark, Erik, and Igor Gashinsky. Neighbor Unreachability Detection Is Too Impatient. Request for Comments 7048. RFC Editor, 2014. doi:10.17487/RFC7048.
威廉·辛普森 (William A.)、托马斯·纳尔滕 (Thomas Narten) 博士、埃里克·诺德马克 (Erik Nordmark) 和赫沙姆·索利曼 (Hesham Soliman)。IP 版本 6 (IPv6) 的邻居发现。征求意见 4861。RFC 编辑,2007。doi:10.17487/RFC4861。
Simpson, William A., Dr. Thomas Narten, Erik Nordmark, and Hesham Soliman. Neighbor Discovery for IP Version 6 (IPv6). Request for Comments 4861. RFC Editor, 2007. doi:10.17487/RFC4861.
特罗恩、奥莱和拉尔夫·德罗姆斯。动态主机配置协议 (DHCP) 版本 6 的 IPv6 前缀选项。征求意见 3633。RFC 编辑,2003。doi:10.17487/RFC3633。
Troan, Ole, and Ralph Droms. IPv6 Prefix Options for Dynamic Host Configuration Protocol (DHCP) Version 6. Request for Comments 3633. RFC Editor, 2003. doi:10.17487/RFC3633.
曾圣友、约翰·杰森·布佐佐夫斯基、金·金尼尔和伯尼·沃尔兹。DHCPv6 租约查询。征求意见 5007。RFC 编辑,2007。doi:10.17487/RFC5007。
Zeng, Shengyou, John Jason Brzozowski, Kim Kinnear, and Bernie Volz. DHCPv6 Leasequery. Request for Comments 5007. RFC Editor, 2007. doi:10.17487/RFC5007.
1. 考虑解决本章讨论的层间发现和映射问题的四种方法中的每一种。构建一个图表来描述每个状态和表面相互作用,以及可能的优化权衡。
1. Consider each of the four ways to solve the interlayer discovery and mapping problem discussed in the chapter. Build a chart describing the state and surface interactions for each one, and what the optimization tradeoffs might be.
2. 描述 IETF 用于维护号码注册的流程。这看起来是一个复杂的系统还是一个简单的系统?看起来它能有效确保标识符的唯一性吗?
2. Describe the process the IETF uses for maintaining number registries. Does this seem like a complex system or a simple one? Does it seem as though it would be effective in ensuring identifier uniqueness?
3. 考虑每天肯定有数百万甚至数亿的 DNS 查询。有多少个 DNS 根服务器?考虑到这两个数字,您认为 DNS 系统如何扩展以支持整个全球互联网?
3. Consider that there must be millions, or perhaps hundreds of millions, of DNS queries each day. How many DNS root servers are there? Given these two numbers, how do you think the DNS system is scaled to support the entire global Internet?
4. 是否可以将全球可访问的 IP 地址转换为 DNS 名称(以与本章所述相反的方向进行映射)?您能想出一个有用的例子吗?
4. Is it possible to convert a globally reachable IP address to a DNS name (to map in the opposite direction from what is described in the chapter)? Can you think of one example where this would be useful?
5.“更大”的 DNS 系统还包含一个从 DNS 名称到有关域所有权的人类可读信息(称为 whois)的映射系统。它使用什么协议进行通信,信息存储在哪里,以及通过该系统可以获得哪些类型的信息?
5. The “larger” DNS system also contains a mapping system from DNS names to human-readable information about domain ownership called whois. What protocol does it use to communicate, where is the information stored, and what kinds of information are available through this system?
6. 解释什么是 DNS 粘合记录以及它们的用途。
6. Explain what DNS glue records are and what they are used for.
7. 从状态、优化和表面的角度来看,像 DHCP 这样的机制和像 SLAAC 这样的机制之间的权衡是什么?不仅要考虑分配地址的难易程度,还要考虑每个地址可能出现的任何安全和控制问题。
7. From the perspective of state, optimization, and surface, what are the tradeoffs between a mechanism like DHCP and one like SLAAC? Consider not only the ease with which addresses can be assigned, but any security and control issues that might arise with each one.
8. 为什么大多数实现默认情况下不启用代理 ARP?启用代理ARP有什么风险?
8. Why would most implementations not enable proxy ARP by default? What is the risk in enabling proxy ARP?
9. 邻居发现协议终端系统到中间系统 (ES-IS) 与 IPv6 ND 相比如何?
9. How does the neighbor discovery protocol End System to Intermediate System (ES-IS) compare to IPv6 ND?
10. 考虑 IPv6 路由器发现如何与默认网关问题相关。
10. Consider how IPv6 Router Discovery works in relation to the default gateway problem.
1 . 该图表取自https://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.xhtml。
1. This chart is taken from https://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.xhtml.
2 . Johnson 和 Narten,基于 UUID 的 DHCPv6 唯一标识符 (DUID-UUID) 的定义。
2. Johnson and Narten, Definition of the UUID-Based DHCPv6 Unique Identifier (DUID-UUID).
3 . Liu 等人,“地址和 DNS 配置上的 DHCPv6/SLAAC 交互问题”。
3. Liu et al., “DHCPv6/SLAAC Interaction Problems on Address and DNS Configuration.”
网络设备插入网络可以解决许多问题,包括连接不同类型的媒体以及通过仅将数据包传送到需要的地方来扩展网络。然而,路由器和交换机本身就是复杂的设备。工程师可以通过专门解决通过网络设备传输数据包时遇到的一小部分问题来构建整个职业生涯。
Network devices are inserted into networks to solve a number of problems, including connecting different kinds of media and scaling a network by only carrying packets where they need to go. Routers and switches are, however, complex devices in their own right; engineers can build an entire career by specializing in solving just a small set of the problems encountered in carrying packets through a network device.
图 7-1用于讨论问题空间的概述。
Figure 7-1 is used to discuss an overview of the problem space.
在图 7-1中,有四个不同的步骤:
In Figure 7-1, there are four distinct steps:
1. 需要将数据包从物理介质复制到设备内的内存中;这有时称为对数据包进行离线计时。
1. The packet needs to be copied off the physical media and into memory within the device; this is sometimes called clocking the packet off the wire.
2. 需要对数据包进行处理,这通常意味着确定正确的出站接口并以任何必要的方式修改数据包。例如,在路由器中,下层报头被剥离并用新的报头替换;在有状态数据包过滤器中,数据包可能会根据内部状态被丢弃;ETC。
2. The packet needs to be processed, which generally means determining the correct outbound interface and modifying the packet in any way necessary. For instance, in a router, the lower layer header is stripped off and replaced with a new one; in a stateful packet filter, the packet may be dropped based on internal state; etc.
3. 需要将数据包从入接口复制到出接口。这通常涉及穿越内部结构或总线的行程。有些系统通过对入站和出站接口使用单个内存池来跳过此步骤;这些被称为共享内存系统(关于网络工程,您会注意到的一件事是事物的名称要么过于聪明,要么过于明显)。
3. The packet needs to be copied from the inbound to the outbound interface. This often involves a trip across an internal fabric, or bus. Some systems skip this step by using a single memory pool for both inbound and outbound interfaces; these are called shared memory systems (one thing about network engineering you will notice is the names of things either tend to be too clever or too obvious).
4. 数据包需要复制回出站物理介质上;这有时称为将数据包计时到线路上。
4. The packet needs to be copied back onto the outbound physical media; this is sometimes called clocking the packet onto the wire.
笔记
Note
较小的系统,特别是那些专注于快速、一致的数据包交换的系统,通常会使用共享内存将数据包从一个接口传输到另一个接口。在内存中复制数据包所需的时间通常大于接口运行的速度;共享内存系统在数据包的内存复制中避免了这种情况。
Smaller systems, particularly those focused on fast, consistent packet switching, will often use shared memory to transfer a packet from one interface to another. The time required to copy a packet in memory is often larger than the speed at which the interfaces operate; shared memory systems avoid this in memory copying of packets.
The problem space discussed in the sections that follow, then, consist of this:
需要由网络设备转发的数据包如何从入站物理介质传送到出站物理介质,以及数据包如何沿着该路径接受处理?
How are packets which need to be forwarded by the network device carried from the inbound to the outbound physical media, and how are packets exposed to processing along this path?
以下每一节都讨论此问题的解决方案的一部分。
Each of the following sections discusses one part of the solution to this problem.
通过网络设备处理数据包的第一步是将数据包从线路复制到内存中。图7-2用于说明该过程。
The first step in processing a packet through a network device is to copy the packet off the wire and into memory. Figure 7-2 is used to illustrate the process.
图7-2中有两个步骤:
There are two steps in Figure 7-2:
步骤 1.物理媒体芯片组(PHY 芯片)会将每个时间(或逻辑)时隙从物理媒体(代表单个数据位)复制到内存位置。该内存位置实际上映射到接收环,它是一组内存位置(数据包缓冲区),专门用于接收从线路上计时的数据包。接收环和所有数据包缓冲存储器通常由线卡或设备接收端的所有交换组件可访问(共享)的单一类型存储器划分。
Step 1. The physical media chipset (the PHY chip) will copy each time (or logical) slot from the physical media, which represents a single bit of data, into a memory location. This memory location is actually mapped into a receive ring, which is a set of memory locations (packet buffer) set aside for the sole purpose of receiving packets being clocked off the wire. The receive ring, and all packet buffer memory, is normally carved out of a single kind of memory accessible by (shared by) all the switching components on the receiving end of the line card or device.
环形缓冲区基于单个指针使用,每次将新数据包插入缓冲区时该指针都会递增。例如,在图 7-2所示的环中,指针将从槽 1 开始,并随着数据包被复制到环缓冲区中而在槽中递增。如果指针到达槽 7,并且有新的数据包到达,则该数据包将被复制到槽 1,无论槽 1 的内容是否已被处理。
A ring buffer is used based on a single pointer, which is incremented each time a new packet is inserted into the buffer. For instance, in the ring shown in Figure 7-2, the pointer would begin at slot 1 and increment through the slots as packets are copied into the ring buffer. If the pointer reaches slot 7, and a new packet arrives, the packet will be copied into slot 1—regardless of whether or not the contents of slot 1 have been processed.
在数据包交换中,最耗时、最困难的任务是将数据包从一个位置复制到另一个位置;通过使用指针可以尽可能避免这种情况。不是在内存中移动数据包,而是在交换路径中将指向内存位置的指针从一个进程传递到另一个进程。
In packet switching, the most time-consuming and difficult task is copying packets from one location to another; this is avoided as much as possible through the use of pointers. Rather than moving a packet in memory, a pointer to the memory location is passed from process to process within the switching path.
步骤 2.一旦数据包被计时到内存中,一些本地处理器就会被中断。在此中断期间,本地处理器将从接收环中删除指向包含数据包的数据包缓冲区的指针,并将指向空数据包缓冲区的指针放置到接收环上。指针被放置在一个称为输入队列的单独列表上。
Step 2. Once the packet is clocked into memory, some local processor is interrupted. During this interrupt, the local processor will remove the pointer to the packet buffer containing a packet from the receive ring and place a pointer to an empty packet buffer onto the receive ring. The pointer is placed on a separate list called the input queue.
一旦数据包进入输入队列,就可以对其进行处理。处理可以被视为一系列事件,而不是单个事件;如图 7-3所示。
Once the packet is in the input queue, it can be processed. Processing can be seen as a chain of events, rather than a single event; Figure 7-3 illustrates.
在数据包交换之前需要进行一些处理,例如网络地址转换,因为它会更改实际交换过程中使用的数据包的一些信息。切换后可以进行其他处理。
Some processing needs to take place before the packet is switched, such as Network Address Translation, because it changes some information about the packet used in the actual switching process. Other processing can take place after the switch.
1. 交换过程在转发表中查找目标媒体访问控制 (MAC) 或物理设备地址(在交换机中,这有时称为桥接学习表,或简称为桥接表)。
1. The switching process looks up the destination Media Access Control (MAC), or physical device, address in a forwarding table (in switches this is sometimes called the bridge learning table, or just the bridge table).
2. 根据该表信息确定出接口。
2. The outbound interface is determined based on the information in this table.
3. 数据包从输入队列移动到输出队列。
3. The packet is moved from the input queue to the output queue.
数据包在切换过程中不发生任何修改;它从输入队列复制到输出队列。
The packet is not modified in any way during the switching process; it is copied from the input queue to the output queue.
笔记
Note
转发表是如何构建的?通过控制平面。本书的第二部分详细讨论了控制平面。
How is the forwarding table built? By a control plane. Part II of this book considers control planes in some detail.
路由是一个比交换更复杂的过程;如图 7-4所示。
Routing is a more complex process than switching; Figure 7-4 illustrates.
在图 7-4中,数据包从输入队列开始。然后切换处理器
In Figure 7-4, the packet begins on the input queue. The switching processor then
1. 删除(或忽略)低层标头(例如数据包上的以太网帧)。该信息用于确定路由器是否应该接收数据包,但在实际交换过程中不会使用。
1. Removes (or ignores) the lower layer header (for instance, the Ethernet framing on the packet). This information is used to determine whether or not the router should receive the packet, but is not used during the actual switching process.
2. 在转发表中查找目标地址(以及可能的其他信息)。转发表将数据包的目的地与数据包的下一跳相关联。下一跳可以是通往目的地的路径中的下一个路由器,也可以是目的地本身。
2. Looks up the destination address (and potentially other information) in the forwarding table. The forwarding table relates the destination of the packet to the next hop of the packet. The next hop can either be the next router in the path toward the destination or the destination itself.
3. 然后,交换处理器检查层间发现表(例如第 6 章“层间发现”中考虑的表),以确定将数据包发送到的正确物理地址,以使数据包离目的地更近一跳。
3. The switching processor then examines an interlayer discovery table (such as those considered in Chapter 6, “Interlayer Discovery”), to determine the correct physical address to which to send the packet to bring the packet one hop closer to the destination.
4. 使用该新的下层目标地址构建新的下层报头并将其复制到数据包上。通常,下层目标地址与整个下层标头一起缓存在本地。整个报头在称为 MAC 报头重写的过程中被重写。
4. A new lower layer header is built using this new lower layer destination address and copied onto the packet. Normally, the lower layer destination address is cached locally, along with the entire lower layer header. The entire header is rewritten in a process called the MAC header rewrite.
整个数据包现在从输入队列移动到输出队列。
The entire packet is now moved from the input queue to the output queue.
因为路由是一个比交换更复杂的过程,为什么要路由呢?用图7-5来说明。
Because routing is a more complex process than switching, why route? Figure 7-5 will be used to illustrate.
在网络中选择路由而不是交换至少有三个具体原因。以图7-5所示网络为例:
There are at least three specific reasons to route, rather than switch, in a network. Using the network in Figure 7-5 as an example:
• 如果 [B,C] 链路是与连接到主机的两条链路不同类型的物理介质,具有不同的编码、标头、寻址等,则路由将允许 A 和 D 进行通信,而无需担心这些差异。链接类型。在纯交换网络中可以通过报头转换来克服这个问题,但是报头转换实际上并不比交换路径中的路由少做任何工作,因此不通过路由来解决这个问题没有什么意义。另一种解决方案可能是让每种物理媒体类型就单一寻址和数据包格式达成一致,但考虑到物理媒体的不断进步以及许多不同类型的物理媒体,这似乎是一个不太可能的解决方案。
• If the [B,C] link is a different kind of physical media than the two links connecting to hosts, with different encoding, headers, addressing, etc., then routing will allow A and D to communicate without worrying about these differences in the link types. This could be overcome in a purely switched network through header translation, but header translation doesn’t really take any less work than routing in the switching path, so there is little point in not routing to solve this problem. Another solution might be for every physical media type to agree on a single addressing and packet format, but given the constant advances in physical media, and the many different kinds of physical media, this seems like an unlikely solution.
• 如果整个网络发生切换,B 需要知道D 和E 的完整可达性信息;具体来说,D 和 E 需要知道连接到 C 之外的主机网段的每个设备的物理或较低层地址。在较小的网络中,这可能不是一个大问题,但在具有数十万个节点的较大网络中,或者全球互联网,这将无法扩展——有太多的状态需要管理。可以使用较低层寻址来聚合可达性信息,但这比使用基于设备的拓扑连接点分配的较高层地址更困难,而不是使用在工厂分配的唯一标识接口芯片组的地址。
• If the entire network were switched, B would need to know full reachability information for D and E; specifically, D and E would need to know the physical or lower layer addresses for each device connected to the host segment beyond C. This might not be a big problem in a smaller network, but in larger networks, with hundreds of thousands of nodes, or the global Internet, this will not scale—there is simply too much state to manage. It is possible to aggregate reachability information with lower layer addressing, but it is more difficult than using a higher layer address assigned based on the device’s topological attachment point, rather than an address assigned at the factory that uniquely identifies the interface chipset.
• 如果D 向“网段上的所有设备”发送广播,如果B 和C 是交换机,则A 将收到该广播,但如果B 和C 是路由器,则不会收到该广播。广播数据包无法消除,因为它们是几乎每个传输协议的重要组成部分,但在纯交换网络中,广播带来了非常难以解决的扩展问题。广播在路由器处被阻止(或者更确切地说被消耗)。
• If D sends a broadcast to “all devices on segment,” A will receive the broadcast if B and C are switches, but not if B and C are routers. Broadcast packets cannot be eliminated, as they are an essential part of just about every transport protocol, but in purely switched networks, broadcasts present a very hard-to-solve scaling problem. Broadcasts are blocked (or rather consumed) at a router.
笔记
Note
在商业网络世界中,术语“路由”和“交换”经常互换使用。造成这种情况的原因主要是营销历史;路由最初总是意味着“在软件中交换”,而交换总是意味着“在硬件中交换”。随着能够在硬件中重写 MAC 标头的数据包交换引擎出现,它们被称为“第 3 层交换机”,最终缩写为“交换机”。例如,大多数数据中心“交换机”实际上是路由器,因为它们确实对转发的数据包执行 MAC 标头重写。如果有人将某个设备称为交换机,那么最好明确它是三层交换机(正确的路由器)还是二层交换机(正确的交换机)。
In the commercial networking world, the terms routing and switching are often used interchangeably. The reason for this is primarily marketing history; routing always originally meant “switched in software,” while switching always meant “switched in hardware.” As packet switching engines capable of rewriting a MAC header in hardware became available, they were called “Layer 3 switches,” which was eventually shortened to just switch. Most data center “switches,” for instance, are actually routers, as they do perform a MAC header rewrite on forwarded packets. If someone calls a piece of equipment a switch, then it is best to clarify whether it is a Layer 3 switch (properly a router) or a Layer 2 switch (properly a switch).
笔记
Note
术语链接和连接在这里可以互换使用;链路是两个设备之间的物理或虚拟有线或无线连接。
The terms link and connection are used interchangeably here; a link is a physical or virtual wired or wireless connection between two devices.
在一些网络设计中,工程师会在两个网络节点之间引入并行链路。如果假设这些并行链路的带宽、延迟等相等,则称它们的成本相等。在这种情况下,链路被称为等价多路径 (ECMP)。
In some network designs, engineers will introduce parallel links between two network nodes. If you assume these parallel links are equal in bandwidth, latency, and so on, they are said to be equal cost. In this scenario, the links are said to be equal cost multipath (ECMP).
在网络中,生产网络上经常出现两种变体。它们的行为相似,但在网络操作系统如何对链接进行分组和管理方面有所不同。
In networking, there are two variants seen frequently on production networks. They behave similarly but are different in how the links are grouped and managed by the network operating system.
链路聚合方案采用多个物理链路并将它们捆绑成单个虚拟链路。出于路由协议和环路预防算法(例如生成树)的目的,虚拟链路被视为单个物理链路。
Link aggregation schemes take multiple physical links and bundle them into a single virtual link. For purposes of routing protocols and loop prevention algorithms such as spanning tree, a virtual link is treated as if it were a single physical link.
链路聚合用于增加网络节点之间的带宽,而无需用较快的物理链路替换较慢的物理链路。例如,两条 10Gbps 链路可以聚合为一条 20Gbps 链路,从而使两个节点之间的潜在带宽加倍,如图7-6所示。“潜力”这个词是经过仔细选择的,因为聚合链接实际上不会线性扩展。
Link aggregation is used to increase bandwidth between network nodes without having to replace slower physical links with faster ones. For instance, two 10Gbps links could be aggregated into a single 20Gbps link, thus doubling the potential bandwidth between the two nodes, as shown in Figure 7-6. The word potential was chosen carefully, as aggregated links do not, in practice, scale linearly.
链路聚合面临的问题是确定哪些数据包应沿着捆绑包的哪个成员发送。直觉上,这似乎不是一个问题。毕竟,以循环方式使用链接包似乎是有意义的。初始帧将沿着束的第一个成员发送,第二帧沿着第二个成员发送,依此类推,最终回绕到第一个链接束成员。这样,链路应该得到完美均匀的使用,并且带宽应该线性扩展。
The problem link aggregation faces is determining which packets should be sent down which member of the bundle. Intuitively, this might not seem like a problem. After all, it would seem to make sense to use the link bundle in a round-robin fashion. The initial frame would be sent down the first member of the bundle, the second frame down the second member, and so on, eventually wrapping back around to the first link bundle member. In this way, the link should be used perfectly evenly, and bandwidth should scale linearly.
在现实生活中,很少有像这样以循环方式使用聚合链接的实现,因为它们存在传送无序数据包的风险。假设以太网帧一被发送给下行链路成员一,并且帧二紧接着被发送给下行链路成员二。无论出于何种原因,第二帧都会先于第一帧到达另一端。这些帧包含的数据包将不按顺序传送到接收主机 - 数据包二在数据包一之前。这是一个问题,因为现在主机需要重新排序数据包,以便可以正确地重新组装整个数据报,从而承担计算负担。
There are a very few real-life implementations where aggregated links are used on a round-robin basis like this because they run the risk of delivering out-of-order packets. Assume Ethernet frame one is sent down link member one, and frame two is sent down link member two immediately after. For whatever reason, frame two gets to the other end before frame one. The packets that these frames contain will be delivered to the receiving hosts out of order—packet two before packet one. This is a problem because a computational burden is now placed on the host to reorder the packets so the entire datagram can be properly reassembled.
因此,大多数供应商实施流散列以确保整个流量使用相同的捆绑成员。通过这种方式,主机不存在乱序接收数据包的风险,因为它们将通过同一链路成员顺序发送。
Therefore, most vendors implement flow hashing to ensure the entirety of a traffic flow uses the same bundle member. In this way, there is no risk of a host receiving packets out of order, as they will be sent sequentially across the same link member.
流哈希的工作原理是对流的两个或多个静态组件执行数学运算,例如源和目标 MAC 地址、源和目标互联网协议 (IP) 地址、传输控制协议 (TCP) 或用户数据报协议 (UDP)用于计算流将使用的链路成员的端口号。由于流的特征是静态的,因此散列算法会对流量中的每个帧或数据包进行相同的计算,从而保证在流的生命周期中使用相同的链路。
Flow hashing works by performing a mathematical operation on two or more static components of a flow, such as source and destination MAC addresses, source and destination Internet Protocol (IP) addresses, or Transmission Control Protocol (TCP) or User Datagram Protocol (UDP) port numbers to compute a link member the flow will use. Because the characteristics of the flow are static, the hashing algorithm results in an identical computation for each frame or packet in a traffic flow, guaranteeing the same link will be used for the life of the flow.
虽然流哈希解决了数据包乱序问题,但它引入了一个新问题。并非所有流量的大小都相同。某些流使用大量带宽,例如用于文件传输、备份或存储的流;这些有时被称为大象流。其他流量非常小,例如用于加载网页或使用 IP 语音进行通信的流量;这些有时称为鼠标流。由于流的大小不同,某些链路成员可能满负荷运行,而其他链路成员则未得到充分利用。
While flow hashing solves the out-of-order packet problem, it introduces a new problem. Not all flows are the same size. Some flows use a high amount of band-width, such as those used for file transfers, backups, or storage; these are sometimes called elephant flows. Other flows are quite small, such as those used to load a web page or communicate using voice over IP; these are sometimes called mouse flows. Because flows are different sizes, some link members might be running at capacity, while others are underutilized.
这种利用率的不匹配让我们回到了线性缩放的问题。如果帧在聚合链路束中完全均匀地实现负载平衡,则向该束添加新链路将均匀地倍增容量。然而,散列算法与不可预测的流量相结合意味着捆绑的链接不会均匀加载。
This mismatch in utilization brings us back around to the point about linear scaling. If frames were load-balanced across an aggregated link bundle perfectly evenly, then adding new links to the bundle would evenly multiply capacity. However, hashing algorithms combined with the unpredictable volume of traffic flows mean bundled links will not be evenly loaded.
网络工程师的工作是了解流经聚合束的流量类型,并选择可用的哈希算法,以实现最均匀的负载分配。例如,一些考虑因素可能是
The job of the network engineer is to understand the type of traffic flowing through the aggregated bundle and choose an available hashing algorithm that will result in the most even load distribution. For instance, some considerations might be
• 同一广播域中的许多主机是否通过聚合链路相互通信?针对以太网帧标头中找到的 MAC 地址进行哈希处理是一种可能的解决方案,因为 MAC 地址会有所不同。
• Are many hosts in the same broadcast domain communicating with one another across the aggregated link? Hashing against the MAC addresses found in the Ethernet frame header is a possible solution, because the MAC addresses will be varied.
• 是否有少量主机通过聚合链路与单个服务器进行通信?在这种情况下,MAC 地址或 IP 地址可能没有足够的种类。相反,针对 TCP 或 UDP 端口号进行散列可能会导致聚合链路上的最大变化和后续流量分布。
• Are a small number of hosts communicating to a single server across the aggregated link? There might not be enough variety of either MAC addresses or IP addresses in this scenario. Instead, hashing against TCP or UDP port numbers might result in the greatest variety and subsequent traffic distribution across the aggregated links.
将链路捆绑在一起时,必须考虑链路两端的网络设备,并特别注意在形成链路捆绑的同时保持无环路拓扑。解决此问题的最常见方法是使用行业标准链路聚合控制协议 (LACP),该协议被编为电气和电子工程师协会 (IEEE) 标准 802.3ad。
When bundling links together, you must consider the network devices on either end of the link and take special care to allow the link bundle to be formed while maintaining a loop-free topology. The most common way of addressing this issue is by using industry standard Link Aggregation Control Protocol (LACP), codified as Institute of Electrical and Electronic Engineers (IEEE) standard 802.3ad.
在网络工程师指定的链路上,LACP 向另一端通告其形成聚合链路的意图。如果通告的参数有效,另一端也运行 LACP,则接受此通告,并形成链接。一旦形成链路束,聚合链路就被置于转发状态。然后,网络运营商可以查询 LACP 以了解聚合链路的状态以及链路成员的状态。
On links designated by a network engineer, LACP advertises its intent to form an aggregated link to the other side. The other side, also running LACP, accepts this advertisement if the announced parameters are valid, and forms the link. Once the link bundle has been formed, the aggregated link is placed into a forwarding state. Network operators can then query LACP for status on the aggregated link and the state of link members.
当链路束的某个成员发生故障时,LACP 也会意识到,因为控制数据包不再流经故障链路。此功能非常有用,因为它允许 LACP 进程通知网络操作系统重新计算其流哈希值。如果没有 LACP,网络操作系统可能需要更长的时间才能发现出现故障的链路,从而导致流量被散列到不再是有效路径的链路成员。
LACP is also aware when a member of the link bundle goes down, as control packets no longer flow across the failed link. This capability is useful, as it allows the LACP process to notify the network operating system to recalculate its flow hashes. Without LACP, it might take the network operating system a longer time to become aware of the failed link, causing traffic to be hashed to a link member that is no longer a valid path.
还存在其他链路聚合控制协议。在某些情况下,也可以在没有控制协议保护的情况下手动创建链路束;然而,LACP 作为网络供应商以及主机操作系统和虚拟机管理程序供应商用于链路聚合的标准占据主导地位。
Other link aggregation control protocols exist. It is also possible in some scenarios to create link bundles manually without the protection of a control protocol; however, LACP dominates as the standard in use by networking vendors as well as host operating systems and hypervisor vendors for link aggregation.
多机箱链路聚合 (MLAG) 是一些网络供应商提供的一项功能,允许单个聚合链路束跨越两个或多个网络交换机。为了实现这一点,MLAG 成员交换机之间将运行供应商的特殊控制协议,使多个网络交换机就 LACP、生成树协议 (STP) 和任何其他协议而言就像一台交换机一样工作。
Multichassis Link Aggregation (MLAG) is a feature offered by some network vendors allowing a single aggregated link bundle to span two or more network switches. To facilitate this, a vendor’s special control protocol will run between the MLAG member switches, making multiple network switches act as if they are one switch as far as LACP, Spanning Tree Protocol (STP), and any other protocols are concerned.
MLAG 的通常理由是物理冗余,其中网络工程师需要网络设备之间的较低层(例如以太网)邻接(而不是路由连接),并且还要求链路束在以下情况下保持可用:链路的远端发生故障。在两个或多个交换机之间分布链路束可以满足此要求。图 7-7说明了这一点。
The usual justification for MLAG is physical redundancy, where a network engineer requires a lower layer (such as Ethernet) adjacency between network devices (instead of a routed connection), and also requires the link bundle to remain up if the remote side of the link fails. Spreading the link bundle between two or more switches allows this requirement to be met. Figure 7-7 illustrates.
虽然许多网络在生产中使用某种类型的 MLAG,但许多其他网络却回避了该技术,至少部分是因为 MLAG 是专有的;不存在多供应商 MLAG 这样的东西。更好的网络设计趋势是远离广泛分散的交换域,这是一种受益于 MLAG 的场景。相反,网络设计正趋向于通过路由互连的受限交换域,从而不再需要 MLAG 技术。
While many networks operate some flavor of MLAG in production, many others have shied away from the technology, at least partially because MLAG is proprietary; there is no such thing as multivendor MLAG. Better network design trends away from widely dispersed switched domains, a scenario that benefits from MLAG. Instead, network design is trending toward constrained switched domains interconnected through routing, obviating the need for MLAG technologies.
路由控制平面,称为路由协议(有关路由和无环路路径计算的更多信息,请参阅本书第二部分中的章节) ,有时会计算通过网络的一组具有相同成本的多条路径。在路由的情况下,具有相同成本的多条链路甚至可能无法连接一对设备;图 7-8说明了这一点。
Routed control planes, called routing protocols (see the chapters in Part II of this book for more information on routing and loop-free path calculation), sometimes compute a set of multiple paths through a network with equal costs. In the case of routing, multiple links with the same cost may not even connect a single pair of devices; Figure 7-8 illustrates.
图7-8中共有3条路径:
In Figure 7-8, there are three paths:
• [A,B,D] 总成本为 10
• [A,B,D] with a total cost of 10
• [A,D] 总成本为 10
• [A,D] with a total cost of 10
• [A,C,D] 总成本为 10
• [A,C,D] with a total cost of 10
由于这三个路径具有相同的成本,因此它们可能全部安装在 A 和 D 处的本地转发表中。例如,路由器 A 可以通过这三个链路中的任何一个将流量转发到 D。当路由器有多个选项时为了到达同一个目的地,它如何决定采取哪条物理路径?
Because these three paths have the same cost, they may all three be installed in the local forwarding table at A and D. Router A, for instance, may forward traffic over any one of these three links toward D. When a router has multiple options to reach the same destination, how does it decide which physical path to take?
与底层 ECMP 一样,答案是散列。路由 ECMP 哈希可以在各种字段上执行。用于哈希的常见字段包括源或目标 IP 地址以及源或目标端口号。散列导致在 L3 流的持续时间内选择一致的路径。仅在链路故障的情况下才需要重新散列流并选择新的转发链路。
As with lower layer ECMP, the answer is hashing. Routed ECMP hashing can be performed on a variety of fields. Common fields to hash against include source or destination IP addresses and source or destination port numbers. The hashing results in a consistent path being selected for the duration of an L3 flow. Only in the case of a link failure would the flow need to be rehashed and a new forwarding link chosen.
路由单个数据包所涉及的步骤可能看起来非常简单——在表中查找目的地,构建(或检索)MAC 标头重写,重写 MAC 标头,然后将数据包放置在出站接口的正确队列上。尽管这可能很简单,但处理单个数据包仍然需要时间。图 7-9说明了数据包在网络设备中交换的三种不同路径。
The steps involved in routing a single packet may seem very simple—look up the destination in a table, build (or retrieve) a MAC header rewrite, rewrite the MAC header, and then place the packet on the correct queue for an outbound interface. As simple as this might be, it still takes time to process a single packet. Figure 7-9 illustrates three different paths through which a packet may be switched in a network device.
图 7-9说明了通过设备的三种不同的切换路径;这些并不是唯一可能的切换路径,但它们是最常见的路径。第一条路径通过在通用处理器 (GPP) 上运行的软件应用程序处理数据包,由三个步骤组成:
Figure 7-9 illustrates three different switching paths through a device; these are not the only possible switching paths, but they are the most common ones. The first path processes packets through a software application running on a general-purpose processor (GPP), and consists of three steps:
1. 数据包从物理介质复制到主内存中,如上面各节所述。
1. The packet is copied off the physical media into main memory, as described in the sections above.
2. 物理信号处理器,即PHY芯片,向GPP(可能但不一定是网络设备中的主处理器)发送信号,称为中断。
2. The physical signal processor, the PHY chip, sends a signal to the GPP (probably, but not necessarily, the main processor in the network device), called an interrupt.
A。中断导致处理器停止其他任务(这就是它被称为中断的原因)并运行一小段代码,该代码将安排另一个进程(即切换应用程序)稍后运行。
a. The interrupt causes the processor to stop other tasks (this is why it is called an interrupt) and run a small piece of code that will schedule another process, the switching application, to run later.
b. 当交换应用程序运行时,它将进行适当的查找并对数据包进行适当的修改。
b. When the switching application runs, it will do the appropriate lookups and make the appropriate modifications to the packet.
3. 一旦数据包被交换,它就会被出站处理器从主存储器中复制出来,如以下各节所述。
3. Once the packet has been switched, it is copied out of main memory by the outbound processor, as described in the following sections.
以这种方式通过进程交换数据包通常称为进程交换(出于明显的原因),有时也称为慢速路径。无论 GPP 有多快,要在更高速的接口上实现全线速交换,都需要进行大量调整,甚至几乎是不可能的。图 7-9所示的第二条交换路径旨在更快地处理数据包:
Switching a packet through a process in this way is often called process switching (for obvious reasons), or sometimes the slow path. No matter how fast the GPP is, to reach full line rate switching on higher-speed interfaces requires a lot of tuning—to the point of being almost impossible. The second switching path shown in Figure 7-9 was designed to process packets more quickly:
4. 数据包从物理介质复制到主内存中,如前面部分所述。
4. The packet is copied off the physical media into main memory, as described in the previous sections.
5、PHY芯片中断GPP;中断处理程序代码实际上处理数据包,而不是调用另一个进程。
5. The PHY chip interrupts the GPP; the interrupt handler code, rather than calling another process, actually processes the packet.
6. 一旦数据包被交换,数据包就会从主存储器复制到输出过程中,如下文所述。
6. Once the packet has been switched, the packet is copied from main memory into the output process, as described in the text that follows.
这个过程通常被称为中断上下文切换,原因显而易见;许多处理器可以支持足够快的交换数据包,以便在此模式下在低速率和中速率接口之间传输数据包。当然,交换代码本身必须高度优化,因为交换数据包会导致处理器停止执行任何其他任务(例如处理路由协议更新)。这最初(有时仍然)被称为快速切换路径。
This process is often called interrupt context switching, for obvious reasons; many processors can support switching packets fast enough to carry packets between low and moderate rate interfaces in this mode. The switching code itself must be highly optimized, of course, because switching the packet causes the processor to stop executing any other tasks (such as processing a routing protocol update). This was originally—and is still sometimes—called the fast switching path.
对于真正的高速应用,交换数据包的过程必须从主处理器或任何类型的 GPP 卸载,并转移到专为处理数据包的特定任务而设计的专用处理器上。有时,这些处理器被称为网络处理单元 (NPU),就像专门处理图形的处理器被称为图形处理单元 (GPU) 一样。这些专用处理器是更广泛的处理器类别(称为专用集成电路 (ASIC))的子集,工程师通常简称为 ASIC。通过 ASIC 交换数据包如图7-9中的步骤 7 到 9 所示:
For truly high-speed applications, the process of switching packets must be offloaded from the main processor, or any kind of GPP, and onto a specialized processor designed for the specific task of processing packets. Sometimes these processors are called Network Processing Units (NPUs), much like a processor designed to handle just graphics is called a Graphics Processing Unit (GPU). These specialized processors are a subset of a broader class of processors called Application-Specific Integrated Circuits (ASICs), and are often just called ASICs by engineers. Switching a packet through an ASIC is shown as steps 7 through 9 in Figure 7-9:
7. 数据包从物理介质复制到 ASIC 内存中,如前面部分所述。
7. The packet is copied off the physical media into the ASIC’s memory, as described in the previous sections.
8、PHY芯片中断ASIC;ASIC 通过交换数据包来处理中断。
8. The PHY chip interrupts the ASIC; the ASIC handles the interrupt by switching the packet.
9. 一旦数据包被交换,数据包就会从 ASIC 的内存复制到输出过程中,如下所述。
9. Once the packet has been switched, the packet is copied from the ASIC’s memory into the output process, as described next.
许多专用数据包处理 ASIC 具有许多有趣的功能,包括
Many specialized packet processing ASICs have a number of interesting features, including
• 专门配置的内部存储器结构(寄存器),用于处理网络中使用的各种地址
• Internal memory structures (registers) configured specifically to handle the various kinds of addresses used in networks
• 专门的指令集,旨在处理各种数据包处理要求,例如检查数据包中携带的内部标头,以及重写 MAC 标头
• Specialized instruction sets designed to handle various packet processing requirements, such as examining the inner headers being carried in a packet, and rewriting the MAC header
• 专门的存储器结构和指令集,旨在存储和查找目标地址以加速数据包处理
• Specialized memory structures and instruction sets designed to store and look up destination addresses to speed packet processing
• 通过数据包管道回收数据包的能力,以便执行单次无法支持的操作,例如深度数据包检查或专门的过滤任务
• The ability to recycle a packet through the packet pipeline in order to perform operations that cannot be supported in a single pass, such as deep packet inspection or specialized filtering tasks
在只有一个网络进程(ASIC 或 NPU,如前所述)的小型网络设备中,将数据包从输入队列移动到输出队列非常简单。输入和输出接口都共享一个公共的数据包内存池,因此指向数据包的指针可以从一个队列移动到另一个队列。
In smaller network devices with just one network process (the ASIC or NPU, as described previously), moving a packet from the input queue to the output queue is simple. The input and output interfaces both share a common pool of packet memory, so a pointer to the packet can be moved from one queue to the other.
为了达到更高的端口数和更大规模的设备(尤其是机箱设备),必须有一个内部总线或结构来连接输入和输出数据包处理引擎。用于互连网络设备内的数据包处理引擎的一种常见结构类型是交叉开关结构。图 7-10说明了这一点。
To reach higher port counts and larger-scale devices—particularly chassis devices— there must be an internal bus, or fabric, that connects the input and output packet processing engines. One common type of fabric used to interconnect packet processing engines within a network device is a crossbar fabric; Figure 7-10 illustrates.
交叉开关结构的大小和结构取决于连接的端口数量。如果交换机中的端口数量多于无法通过单个交叉开关结构进行连接的端口,则交换机将使用多个交叉开关结构。这种结构的常见拓扑是将入口和出口交叉结构连接在一起的多级 Clos。您可能会将其视为横杆中的横杆。
The size and structure of the crossbar fabric are dependent on the number of ports connected. If there are more ports in the switch than feasible to connect via a single crossbar fabric, then the switch will use multiple crossbar fabrics. A common topology for this kind of fabric is a multistage Clos connecting the ingress and egress crossbar fabrics together. You might think of this as a crossbar of crossbars.
脊叶结构是 Clos 的一种形式,在第 25 章“分解、超融合和不断变化的网络”中进行了讨论。
Spine and leaf fabrics, which are a form of Clos, are considered in Chapter 25, “Disaggregation, Hyperconvergence, and the Changing Network.”
交叉开关结构需要时间感(或者更确切地说是固定的时隙)和调度程序才能工作。在每个时间间隔,一个输出(发送)端口连接到一个输入(接收)端口,以便在该时间段内发送方可以向接收方发送一个数据包、帧或一组数据包。调度程序“连接”正确的交叉交叉开关结构上的点,以便在正确的时间段内进行传输。例如:
A crossbar fabric requires a sense of time (or rather a fixed time slot) and a scheduler to work. At each interval of time, one output (send) port is connected to one input (receive) port, so that during this time period the sender can transmit a packet, frame, or set of packets to the receiver. The scheduler “connects” the correct cross points on the crossbar fabric for transmissions to take place during the correct time period. For instance:
• 线卡1 (LC1) 想要向LC3 发送数据包。
• Line card 1 (LC1) would like to send a packet to LC3.
• LC3 希望向LC5 发送数据包。
• LC3 would like to send a packet to LC5.
在下一个时间周期中,调度程序可以将 A 行连接到第 1 列(在 A3 处“建立”连接),并将 C 行连接到第 5 列(在 C5 处“建立”连接),以便在这些对之间建立通信通道线卡。
During the next time cycle, the scheduler can connect row A to column 1 (“make” the connection at A3) and connect row C to column 5 (“make” the connection at C5) so a communication channel is set up between these pairs of line cards.
如果两个发送器想要将数据包发送到单个接收器会发生什么?例如,如果在一段时间内 LC1 和 LC2 都想通过交叉结构向 LC9 发送数据包?这称为争用,并且是结构调度程序必须处理的情况。应允许两个入口端口中的哪一个将其流量发送到出口端口?与此同时,入口流量队列在哪里?
What happens if two transmitters want to send a packet to a single receiver? For instance, if during one period of time both LC1 and LC2 want to send a packet to LC9 across the crossbar fabric? This is called contention, and is a situation that must be handled by the fabric scheduler. Which of the two ingress ports should be allowed to send their traffic to the egress port? And where are the ingress traffic queues in the meantime?
一种选择是将数据包存储在输入队列中;使用这种技术的交换机称为输入队列交换机。此类交换机会受到队头 (HOL) 阻塞的影响。HOL 阻塞是指当线路头部的数据包等待通过结构转发时,会阻塞在其后面排队的其他数据包。
One option is for the packets to be stored in an input queue; switches that use this technique are called input-queued switches. These kinds of switches suffer from head-of-line (HOL) blocking. HOL blocking is what happens when the packet at the head of the line, waiting to be forwarded across the fabric, blocks the other packets queued up behind it.
交换机的另一种选择是在每个输入端口利用多个虚拟输出队列 (VOQ)。
Another option is for the switch to leverage multiple virtual output queues (VOQs) per input port.
VOQ 为交叉开关结构提供了多个位置来存储等待传送到其出口端口的入口数据包。在许多交换机设计中,输入流量所指向的每个输出端口都存在一个 VOQ。因此,假设有多个不同的出口端口,输入端口可以有多个数据包在多个不同的 VOQ 中排队。
VOQs give a crossbar fabric multiple places to stash ingress packets while they are waiting to be delivered to their egress ports. In many switch designs, one VOQ exists per output port for which input traffic is destined. Therefore, an input port can have several packets queued in several different VOQs, assuming several different egress ports.
这些 VOQ 中的每一个都可以在单个时钟周期内得到服务。这意味着消除了 HOL 阻塞,因为来自同一输入队列的多个不同数据包可以同时通过交叉开关结构。输入端口不存在单个队列,而是存在多个不同的队列。可以将其视为杂货店开设的额外收银台。
Each of these VOQs is eligible to be serviced during a single clock cycle. This means HOL blocking is eliminated, because several different packets from the same input queue can be passed through the crossbar fabric at the same time. Rather than a single queue existing for an input port, there are several different queues. Think of it as additional checkout lines being opened at the grocery store.
即使使用 VOQ,跨交叉结构的竞争仍然存在。最常见的示例是两个或多个入口数据包需要离开同时,或者更准确地说,在相同的时钟周期通过相同的出口端口进行切换。一个出口端口每个时钟周期只能发送一个数据包。
Even with VOQs, the potential remains for contention across the crossbar fabric. The most common example is where two or more ingress packets need to leave the switch via the same egress port at the same time, or more precisely, on the same clock cycle. An egress port can only send one packet per clock cycle.
确定哪个入口队列将首先将流量传送到出口端口是由交换机制造商确定的算法,以最大限度地利用硬件。iSLIP是交换机用来解决这个问题的一种调度算法。
Determining which ingress queue will get to deliver traffic to the egress port first is an algorithm determined by the switch manufacturer to make the maximum use of the hardware. iSLIP is one scheduling algorithm used by switches to solve this problem.
iSLIP 算法仲裁 crossbar 结构争用、调度流量,以便网络设备实现无阻塞吞吐量。出于本次讨论的目的,通过回顾 iSLIP 算法执行一次时发生的情况,以最简单的形式仔细检查 iSLIP 是有帮助的。
The iSLIP algorithm arbitrates crossbar fabric contention, scheduling traffic so the network device achieves nonblocking throughput. For the purposes of this discussion, it is helpful to scrutinize iSLIP in its simplest form by reviewing what happens when the iSLIP algorithm executes once.
iSLIP 执行期间会发生三个关键事件:
There are three crucial events that take place during an iSLIP execution:
1.请求。交叉开关结构上具有排队流量的所有输入点(入口)询问其输出点(出口)是否可以发送。
1. Request. All input points (ingress) on the crossbar fabric with queued traffic ask their output points (egress) if they can send.
2.格兰特。每个接收到请求的输出点必须确定允许哪个输入点发送。如果只有一个请求,则无需进一步审议即可授予拨款。但是,如果有多个请求,则输出点必须确定哪个输入点可以发送。这是通过循环法完成的,其中一个请求被授予授权,后续请求在 iSLIP 的下一次执行期间被授予授权,依此类推。当决定执行 iSLIP 的特定操作时,每个输出点都会向适当的输入点发送其授权消息,有效地发出发送许可的信号。
2. Grant. Each output point that received a request must determine which input point will be allowed to send. If there is a single request, then a grant is awarded with no further deliberation. However, if there are multiple requests, the output point must determine which input point can send. This is done via round-robin, where one request is awarded a grant, a subsequent request is awarded a grant during the next execution of iSLIP, and so on in a circular fashion. When the decision has been made for this particular execution of iSLIP, each output point sends its grant message, effectively signaling permission to send, to the appropriate input point.
3.接受。输入点考虑从输出点接收到的授权消息,以循环方式选择授权。选择后,输入通知输出赠款已被接受。当且仅当通知输出点已接受授权时,输出点才会继续处理下一个请求。如果没有收到接受消息,则输出点将在下一次执行 iSLIP 期间尝试服务先前的请求。
3. Accept. An input point considers the grant messages it has received from output points, choosing a grant in round-robin fashion. Upon selection, the input notifies the output that the grant has been accepted. If and only if the output point is notified the grant was accepted will the output point move on to the next request. If there is no accept message received, then the output point will attempt to service the previous request during the next execution of iSLIP.
了解请求、授予和接受过程使我们能够深入了解如何通过交叉开关结构同时传送数据包而不发生冲突。然而,如果您考虑一组复杂的输入、VOQ 和输出,您可能会
Understanding the request, grant, and accept process gives us insight into how packets can be delivered simultaneously through a crossbar fabric without colliding. However, if you ponder a complex set of inputs, VOQs, and outputs, you might
意识到单次 iSLIP 运行不会安排与单次执行后所能发送的数据包一样多的数据包。
realize a single iSLIP run doesn’t schedule as many packets for delivery as it could have after only a single execution.
当然,某些输入被授予输出并且某些数据包可以被转发,但某些输出可能从未与等待的输入匹配。换句话说,如果将 iSLIP 限制为每个时钟周期执行一次,我们就会留下未使用的可用出口带宽。
Certainly, some inputs were granted outputs and some packets can be forwarded, but it is possible some outputs were never matched with a waiting input. In other words, if you limit iSLIP to a single execution per clock cycle, we’d be leaving available egress bandwidth unused.
因此,通常的做法是通过多次迭代来运行 iSLIP。结果是输入到输出匹配的数量最大化。一次可以通过交叉开关结构发送更多数据包。iSLIP 需要运行多少次才能最大化一个时钟周期内可通过交叉开关结构交换的数据包数量?研究表明,对于大多数网络上普遍存在的流量模式,运行 iSLIP 四次最能匹配交叉开关结构上的输入和输出。执行 iSLIP 四次以上不会导致匹配数量明显增加。换句话说,在大多数网络环境中运行 iSLIP 五次、六次或十次没有任何好处。
Therefore, the normal practice is to run iSLIP through multiple iterations. The result is the number of input-to-output matches is maximized. More packets can be sent across the crossbar fabric at a time. How many times does iSLIP need to run to maximize the number of packets that can be switched through the crossbar fabric in a clock cycle? Research suggests that for the traffic patterns prevalent on most networks, running iSLIP four times matches inputs and outputs across the crossbar fabric the best. Executing iSLIP more than four times does not result in a meaningfully larger number of matches. In other words, there is nothing to be gained running iSLIP five, six, or ten times in most network environments.
到目前为止,本讨论假设流经交叉开关结构的流量都同等重要。然而,在现代数据中心中,某些流量类别的优先级高于其他流量类别。例如,以太网光纤通道 (FCoE) 存储帧需要以无损方式遍历结构,而落入清道夫 QoS 类别的 TCP 会话则不需要。
This discussion has assumed, so far, that the traffic flowing through the crossbar fabric was all of equal importance. However, in modern data centers, certain traffic classes are prioritized over the other. For instance, Fibre Channel over Ethernet (FCoE) storage frames need to traverse the fabric in a lossless manner, while a TCP session falling into a scavenger QoS class does not.
iSLIP 是否处理具有不同优先级的流量,在其他请求之前批准某些请求?是的,但是是我们所研究的算法的修改形式。iSLIP 的变体包括优先 iSLIP、阈值 iSLIP 和加权 iSLIP。
Does iSLIP handle traffic with different priorities, granting some requests before others? Yes, but in a modified form of the algorithm we’ve looked at. Variants to iSLIP include Prioritized, Threshold, and Weighted iSLIP.
除了 iSLIP(此处仅用作争用管理的方便示例)之外,供应商还将编写自己的算法来适应自己的交叉开关结构的硬件功能。例如,本节仅考虑输入排队交叉开关结构,但许多交叉开关结构也在交叉开关的出口侧提供输出排队。
Beyond iSLIP, used here merely as a convenient example of contention management, vendors will write their own algorithms to suit their own crossbar fabric’s hardware capabilities. For example, this section only considered an input-queued crossbar fabric, but many crossbar fabrics offer output-queuing on the egress side of the crossbar as well.
一旦数据包通过总线传送到出站线卡,或者数据包缓冲区上的指针从输入队列移动到输出队列,网络设备仍然有工作要做。如图 7-11所示。
Once the packet is carried across the bus to the outbound line card, or the pointer on the packet buffer is moved from the input queue to the output queue, there is still work for the network device to do. Figure 7-11 illustrates.
请注意,图 7-11中所示的环是发送环,而不是接收环。图7-11中有四个步骤:
Note the ring shown in Figure 7-11 is the transmit ring, rather than the receive ring. There are four steps in Figure 7-11:
步骤1.数据包被传递到路由器的发送侧进行转发。这里可能需要进行切换后处理,具体取决于平台和具体功能;本图中未显示这些内容。首先将尝试将数据包直接放置在可以传输数据包的传输环上。如果环上已经有数据包,或者环已满(取决于实现),则数据包将不会放置在传输环上。如果数据包放置在传输环上,则跳过步骤 2(这意味着不会使用任何出站服务质量 [QoS] 规则处理数据包)。否则,数据包将被放置在输出队列中,等待被传输到传输环。
Step 1. The packet is passed to the transmit side of the router for forwarding. There may be post switch processing that needs to be done here, depending on the platform and specific features; these are not shown in this illustration. An attempt will first be made to place the packet directly on the transmit ring, where it can be transmitted. If the ring already has a packet on it, or if the ring is full (depending on the implementation), the packet will not be placed on the transmit ring. If the packet is placed on the transmit ring, step 2 is skipped (which means the packet will not be processed using any outbound Quality of Service [QoS] rules). Otherwise, the packet is placed on the output queue, where it will await being transferred to the transmit ring.
步骤 2.如果数据包无法放置在传输环上,则会将其放置在输出队列上以供稍后保留。
Step 2. If the packet cannot be placed on the transmit ring, it will be placed on the output queue for holding for some later time.
步骤 3.发送代码会定期将数据包从输出队列移至发送环。从输出队列中取出数据包的顺序将取决于 QoS 配置;有关如何将 QoS 应用于各种情况下的队列的更多信息,请参阅第 8 章“服务质量”。
Step 3. Periodically, the transmit code will move packets from the output queue to the transmit ring. The order in which packets are taken from the output queue will depend on the QoS configuration; see Chapter 8, “Quality of Service,” for more information on how QoS is applied to queues in various situations.
步骤 4.在数据包移动到传输环路后的某个时刻,传输 PHY 芯片从数据包缓冲区读取每个位,将其编码为出站物理媒体类型的正确格式,并将数据包复制到线路上。
Step 4. At some point after the packet has been moved to the transmit ring, the transmit PHY chip, which reads each bit from the packet buffer, encodes it into the proper format for the outbound physical media type and copies the packet onto the wire.
数据包交换的细节似乎陷入了细节之中。毕竟,数据包或帧如何在两个设备之间移动到底重要吗?理解串行化和反串行化、等价多路径、交叉结构争用、传输环等真的那么重要吗?
The details of packet switching might seem mired in minutiae. After all, does it matter exactly how a packet or frame moves between two devices? Is it really all that critical to comprehend serialization and deserialization, equal cost multipath, crossbar fabric contention, transmit rings, and the like?
从某种意义上说,这些细节对于普通网络工程师来说并不重要。当网络设备完成通过其传输数据的工作时,交换机完成该工作所遵循的实际过程是微不足道的。“它确实有效。”
In a certain sense, these details don’t matter to the average network engineer. When a network device is doing its job moving data through it, the actual processes followed by the switch to get that job done are trivialities. “It just works.”
然而,交换内部结构通常是网络设计的重要因素。例如,考虑端口到端口的延迟。在某些高流量网络中,交换机将帧从入口端口移动到出口端口所需的时间会影响整体应用程序性能。在现代交换机中,端口到端口延迟以单微秒或数百纳秒为单位进行测量。如果一个交换机可以在 1 微秒内完成工作,而另一个交换机可以在 400 纳秒内完成工作,这可能会影响硬件选择。
However, switching internals often factor greatly into network design. For example, consider port-to-port latency. In some high-traffic networks, the amount of time it takes for a switch to move a frame from ingress port to egress port makes a difference in overall application performance. In modern switches, port-to-port latency is measured in single microseconds or hundreds of nanoseconds. If one switch gets the job done in 1 microsecond, while another can do it in 400 nanoseconds, that can impact a hardware choice.
另一个考虑因素是故障排除。当网络设备似乎没有转发其收到的所有数据包(即入口多于出口)时,会发生什么情况?网络结构中的少量数据包丢失很难追踪。了解网络设备的内部数据包交换过程可以帮助我们了解可能发生故障的位置。
Another consideration is troubleshooting. What happens when a network device does not appear to be forwarding all of the packets it receives, i.e. there is more ingress than egress? Small amounts of packet loss in a network fabric are troublesome to track down. Understanding a network device’s internal packet switching process shines a great deal of light on where the breakdown might be happening.
因此,不要认为数据包交换“离电线太近”而与有抱负的网络工作者无关。相反,掌握数据包交换知识,可以深入了解其提供的整体网络性能。
Therefore, don’t dismiss packet switching as “too close to the wires” to be relevant to the aspiring networker. Rather, embrace a knowledge of packet switching for the deep insights into overall network performance that it supplies.
“1.5。操作系统如何工作的基础知识”。操作系统学习指南。访问日期:2017 年 4 月 22 日。http: //faculty.salina.k-state.edu/tim/ossg/Introduction/OSwor king.html 。
“1.5. Basics of How Operating Systems Work.” Operating Systems Study Guide. Accessed April 22, 2017. http://faculty.salina.k-state.edu/tim/ossg/Introduction/OSworking.html.
Bollapragada、维杰、拉斯·怀特和柯蒂斯·墨菲。Cisco IOS 软件架构内部。印第安纳州印第安纳波利斯:思科出版社,2000 年。
Bollapragada, Vijay, Russ White, and Curtis Murphy. Inside Cisco IOS Software Architecture. Indianapolis, IN: Cisco Press, 2000.
BSTJ 18:1939 年 1 月:纵横拨号电话交换系统。(Scudder,FJ;Reynolds,JN),1939 年。http://archive.org/details/bstj18-1-76。
BSTJ 18: 1. January 1939: Crossbar Dial Telephone Switching System. (Scudder, F. J.; Reynolds, J. N.), 1939. http://archive.org/details/bstj18-1-76.
“Cisco Nexus 5548P 交换机架构。” 思科。访问日期:2017 年 7 月 29 日。http ://www.cisco.com/c/en/us/products/collat eral/switches/nexus-5548p-switch/white_paper_c11-622479.html 。
“Cisco Nexus 5548P Switch Architecture.” Cisco. Accessed July 29, 2017. http://www.cisco.com/c/en/us/products/collateral/switches/nexus-5548p-switch/white_paper_c11-622479.html.
“快速以太网| 将 100mbps 集成到现有 10mbps 网络中。” 萨维乌斯。访问日期:2017 年 4 月 22 日。https: //www.savvius.com/resources/compendium/fast_ethernet/overview。
“Fast Ethernet | Integrating 100mbps into Existing 10mbps Networks.” Savvius. Accessed April 22, 2017. https://www.savvius.com/resources/compendium/fast_ethernet/overview.
海涅曼、乔治·T.、加里·波利斯和斯坦利·塞尔科夫。简而言之,算法:实用指南。第二版。奥莱利媒体,2016。
Heineman, George T., Gary Pollice, and Stanley Selkow. Algorithms in a Nutshell: A Practical Guide. 2nd edition. O’Reilly Media, 2016.
因尼斯、达里尔和罗伊·鲁宾斯坦。硅光子学:推动下一次信息革命。第一版。摩根考夫曼,2016。
Inniss, Daryl, and Roy Rubenstein. Silicon Photonics: Fueling the Next Information Revolution. 1st edition. Morgan Kaufmann, 2016.
“英特尔以太网交换机系列哈希效率。” 英特尔,2009 年 4 月。https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ethernet-switch-hash-efficiency-paper.pdf。
“Intel Ethernet Switch Family Hash Efficiency.” Intel, April 2009. https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ethernet-switch-hash-efficiency-paper.pdf.
“打断。” 维基百科,访问日期:2017 年 2 月 3 日。https: //en.wikipedia.org/w/index.php? title=Interrupt&oldid=763436239 。
“Interrupt.” Wikipedia, Accessed February 3, 2017. https://en.wikipedia.org/w/index.php?title=Interrupt&oldid=763436239.
Kloth, Axel K.高级路由器架构。佛罗里达州博卡拉顿:CRC Press,2005 年。
Kloth, Axel K. Advanced Router Architectures. Boca Raton, FL: CRC Press, 2005.
Konheim, Alan G.计算机科学中的哈希:五十年的切片和切块。第一版。威利国际科学,2011。
Konheim, Alan G. Hashing in Computer Science: Fifty Years of Slicing and Dicing. 1st edition. Wiley-Interscience, 2011.
莱卡斯,帕诺斯。网络处理器:架构、协议和平台。第一版。纽约:麦格劳-希尔教育,2003 年。
Lekkas, Panos. Network Processors: Architectures, Protocols and Platforms. 1st edition. New York: McGraw-Hill Education, 2003.
迈纳斯 (Chad R.)、亚历克斯·刘 (Alex X. Liu) 和埃里克·托恩 (Eric Torng)。用于高速互联网路由器的基于硬件的数据包分类。2010年版。纽约:施普林格,2010。
Meiners, Chad R., Alex X. Liu, and Eric Torng. Hardware Based Packet Classification for High Speed Internet Routers. 2010 edition. New York: Springer, 2010.
努比尔、格瓦拉。“信号编码技术”。访问日期:2017 年 4 月 22 日。http: //www.ccs.neu.edu/home/noubir/Courses/CS6710/S12/slides/signals-encoding.pdf。
Noubir, Guevara. “Signal Encoding Techniques.” Accessed April 22, 2017. http://www.ccs.neu.edu/home/noubir/Courses/CS6710/S12/slides/signals-encoding.pdf.
斯特林菲尔德、纳基亚、拉斯·怀特和斯塔西娅·麦基。思科快速转发。第一版。印第安纳州印第安纳波利斯:思科出版社,2007 年。
Stringfield, Nakia, Russ White, and Stacia McKee. Cisco Express Forwarding. 1st edition. Indianapolis, IN: Cisco Press, 2007.
塔库尔、迪内什。“编码技术和编解码器。” 电脑笔记。访问日期:2017 年 4 月 22 日。http: //ecomputernotes.com/computernetworkingnotes/communication-networks/encoding-techniques-and-codec。
Thakur, Dinesh. “Encoding Techniques and Codec.” Computer Notes. Accessed April 22, 2017. http://ecomputernotes.com/computernetworkingnotes/communication-networks/encoding-techniques-and-codec.
“了解 IEEE 802.3ad 链路聚合 — 技术文档 — 支持 — 瞻博网络”,2013 年 3 月 26 日。https://www.juniper.net/documentation/en_US/junose14.2/topics/concept/802.3ad-link-aggregation -理解.html。
“Understanding IEEE 802.3ad Link Aggregation—Technical Documentation—Support—Juniper Networks,” March 26, 2013. https://www.juniper.net/documentation/en_US/junose14.2/topics/concept/802.3ad-link-aggregation-understanding.html.
1. 如果链路的一端配置为捆绑,而另一端未配置,会发生什么情况?具体来说,如果一台设备认为 STP 正在运行而另一台设备没有运行,会发生什么情况?
1. What happens if one end of a link is configured as a bundle and the other end is not? Specifically, what happens if one device thinks STP is running and the other does not?
2. 为什么在 ECMP 中通常使用流哈希而不是循环作为转发算法?
2. Why is flow hashing typically used as opposed to round-robin as a forwarding algorithm in ECMP?
3. 多级结构的用途是什么?举个例子。
3. What is the purpose of a multistage fabric? Provide an example.
4. 简要总结 crossbar 结构中用于缓解争用的技术。
4. Briefly summarize techniques found in crossbar fabrics to mitigate contention.
5. iSLIP算法有Request、Grant、Accept步骤。在每个句子中解释每个步骤中发生的情况。
5. The iSLIP algorithm has steps of Request, Grant, and Accept. In a single sentence for each, explain what happens in each step.
6. iSLIP 需要运行多少次才能不再有效改善输入到输出匹配?
6. How many times does iSLIP need to run before it is no longer effective in improving input-to-output matches?
7. 传输环上一次可以放置多少个数据包?
7. How many packets can be placed on the transmit ring at a time?
8. 为什么不使发送和接收环足够大以防止任何数据包被覆盖,因为保存在环缓冲区中的数据包处理得不够快?在交换机的切换速度、内存利用率和其他因素方面有何权衡?
8. Why not make the transmit and receive rings large enough to prevent any packet from ever being overwritten because the packets being held in the ring buffer are not processed quickly enough? What are the tradeoffs in terms of switching speed through the switch, memory utilization, and other factors?
9. 研究并描述网络中广播风暴的影响。路由如何防止广播风暴?
9. Research and describe the impact of a broadcast storm in a network. How does routing prevent broadcast storms?
10. 使用 MLAG 构建非常大的、无需路由的扁平网络有哪些优点?有哪些缺点?
10. What are some advantages of using MLAG to build very large, flat networks without routing? What are some disadvantages?
平日里,高速公路足够宽阔,足以容纳旅客。有足够的车道。设置速度限制是为了使交通能够快速通过该地区。汽车的数量并不算过多。这条高速公路上的车辆可以有效地行驶,沿着道路行驶,无需争抢位置、踩刹车、在车道之间穿梭,或以其他方式应对过多的交通。也就是说,在一个平常的日子里。
On an ordinary day, the highway was wide enough to accommodate travelers. There were enough lanes. The speed limit was set to move traffic through the area quickly. The volume of cars was not excessive. Vehicles on this highway moved along effectively, moving down the road without having to jostle for position, stand on the brakes, weave in between lanes, or otherwise negotiate excessive traffic. That is, on an ordinary day.
这不是平常的一天。这一天,总统要进城。总统正在发表演讲,很多人想听这个演讲。随着总统演讲时间的临近,原本畅通的高速公路上的交通量也开始增加。起初,这并不是一个问题。高速公路很少满负荷运行,因此交通量的增加是可以控制的。诚然,路上的车辆越来越多,而且行驶的距离也越来越近。但这并没有造成任何问题。
This was not an ordinary day. On this day, the president was coming to town. The president was making a speech, and many people wanted to hear this speech. As the hour got closer to the president’s speech, the ordinarily effective highway saw an increase in traffic. At first, this was not a concern. The highway rarely ran at capacity, and so an increase in traffic was manageable. Granted, there were more vehicles on the road, and they were running closer together. But this didn’t cause any problems.
随着时间的推移,距离总统演讲的时间越来越近,人流量再次增加。现在,出现了问题。高速公路不再能够承载试图穿越它的交通量。汇入高速公路的车辆发现自己在入口匝道处陷入了困境。其他车辆被困在高速公路上,尽管移动得很慢。一些车辆放弃了高速公路,掉头回家,希望能在电视或直播中看到总统的讲话。
As the day wore on, and the time for the president’s speech became quite close, the traffic had increased yet again. Now, there were problems. The highway was no longer able to carry the volume of traffic trying to run across it. Vehicles merging onto the highway found themselves stuck in lines at the on-ramp. Other vehicles were trapped on the highway, moving, albeit very slowly. Some vehicles gave up on using the highway, turning around and heading back home, hoping to catch the president’s speech on television or via live stream.
总统的车队从地区机场驶向演讲地点。他们的车辆也受到了拥堵高速公路的影响。然而,总统游行比路上的任何其他车辆都更具有重要性。为了表明自己的重要性,他们打开了应急灯。警察护送车辆、总统保护人员、豪华轿车和威胁响应卡车都闪烁着红色和蓝色的灯光。
The president’s cavalcade of vehicles drove from the regional airport to the site of the speech. Their vehicles, too, were impacted by the congested highway. However, the presidential parade had more of something than any other vehicles on the road had—importance. To indicate their importance, they put on their emergency lights. Police escort vehicles, presidential protection detail, limousines, and threat response trucks all lit up in flashing red and blue.
随着总统的车辆向前行驶,沿着优先车道驶向演讲地点,陷入困境的高速公路交通开始疏远。不是每个人都能参加演讲,但总统不能成为交通拥堵的受害者。无论高速公路多么超载,总统都必须通过。总统是发表讲话的人。
The struggling highway traffic moved aside as the president’s vehicles surged forward, heading down a priority lane to the site of the speech. Not everyone was going to make it to the speech, but the president couldn’t be victimized by the traffic. No matter how overloaded the highway was, the president had to get through. The president was the one making the speech.
网络工程师经常面临链路太小的流量太大的问题。特别是,在网络中的几乎每条路径中,一条链路限制了整条路径,就像一个交叉路口或一条道路限制了交通流量一样。如图 8-1所示。
Network engineers frequently face the problem of too much traffic for too small of a link. In particular, in almost every path through a network, one link restricts the entire path, much as one intersection or one road restricts the flow of traffic. Figure 8-1 illustrates.
在图 8-1中,A 正在与 G 通信,B 正在与 E 通信。如果这些设备对中的每一对都使用接近其本地链路上的可用带宽([A,C]、[B,C]、 [F,G] 和 D,E]),假设所有链路的速度相同,[C,D] 链路将被流量淹没,成为网络中的阻塞点。
In Figure 8-1, A is communicating with G, and B is communicating with E. If each of these pairs of devices are using close to the available bandwidth on their local links ([A,C], [B,C], [F,G], and D,E]), assuming all the links are the same speed, the [C,D] link will be overwhelmed with traffic, becoming a choke point in the network.
当链路拥塞时,例如图 8-1中的 [C,D] 链路,沿该链路发送的流量会超出该链路的承载能力。在拥塞期间,网络设备(例如路由器或交换机)必须确定应转发哪些流量、应丢弃哪些流量以及应按什么顺序发送数据包被转发。为了应对这一挑战,已经构建了各种优先级方案。
When a link is congested, such as the [C,D] link in Figure 8-1, there is more traffic to be sent down the link than the link has capacity to carry. During times of congestion, a network device such as a router or switch must determine which traffic should be forwarded, which should be dropped, and in what order packets should be forwarded. Various prioritization schemes have been constructed to address this challenge.
通过优先考虑某些流量类别来管理链路拥塞属于服务质量 (QoS) 的范畴。由于多种原因,网络工程师对 QoS 的认知受到困扰。例如,许多实现,甚至是最近的实现,往往都没有经过深思熟虑,特别是在它们的配置和维护方式方面。此外,早期的方案并不总是能很好地发挥作用,QoS 常常会增加而不是缓解网络中的问题,并且往往很难排除故障。
Managing link congestion by prioritizing some traffic classes over others comes under the broad heading of Quality of Service (QoS). The perception of QoS among network engineers is troubled for many reasons. For instance, many implementations, even recent ones, tend to be not as well thought out as they could be, especially in the way they are configured and maintained. Further, early schemes did not always work well, and QoS can often add to the problems in a network, rather than relieve them, and tends to be very difficult to troubleshoot.
由于这些原因,并且由于实现优先级排序方案所需的配置往往比较晦涩难懂,因此 QoS 通常被认为是一种黑暗艺术。要成功实施 QoS 策略,您必须对流量进行分类,为各种流量类别定义排队策略,并在可能遇到链路拥塞的所有网络设备上一致地安装该策略。
For these reasons, and because the configuration required to implement prior-itization schemes tends toward the arcane, QoS is often considered a dark art. To successfully implement a QoS strategy, you must classify traffic, define a queueing strategy for various traffic classes, and install the strategy consistently across all network devices that might experience link congestion.
虽然 QoS 方案和实现的许多不同特性和功能可能会被掩盖,但结果应该始终是相同的。总统必须及时发表讲话。
While it is possible to become buried in the many different features and functions of QoS schemes and implementations, the result should always be the same. The president must deliver a timely speech.
在思考了 QoS 的价值主张之后,一个明显的反应是想知道为什么网络工程师不简单地将链路大小设置得足够大以避免拥塞。毕竟,如果链路足够大,拥塞就会消失。如果拥塞消失,那么将一种流量类型优先于另一种流量类型的需要就会消失。所有流量都将被传送,并且所有这些因带宽不足而产生的恼人问题都将被消除。事实上,过度配置也许是最好的 QoS。
After thinking through the value proposition of QoS, an obvious reaction is to wonder why network engineers don’t simply size links large enough to avoid congestion. After all, if links were large enough, congestion would disappear. If congestion disappeared, then the need to prioritize one traffic type over another would disappear. All traffic would be delivered, and all of these pesky problems rooted in insufficient bandwidth would be obviated. Indeed, overprovisioning is perhaps the best QoS of all.
遗憾的是,过度配置策略并不总是可用的选择。即使是这样,最大的可用链接也无法克服某些流量模式。某些应用程序在传输数据时会使用尽可能多的可用带宽,从而为共享该链路的其他应用程序造成拥塞点。其他传输会以微突发的形式进行传输,从而在短时间内压垮网络资源,并且某些传输机制(例如传输控制协议 (TCP))会偶尔故意拥塞路径,以确定发送数据的最佳速率。虽然较大的链路可以减少拥塞状况存在的时间,但在某些情况下,并不存在足够的带宽来满足所有需求。
Sadly, the overprovisioning strategy is not always an available option. Even if it were, the very largest links available can’t overcome certain traffic patterns. Some applications will use as much bandwidth as available when transferring data, creating a point of congestion for other applications sharing the link. Others will transmit in micro-bursts, overwhelming network resources for a short time, and some transport mechanisms—such as the Transmission Control Protocol (TCP)—will intentionally congest a path occasionally in order to determine the best rate at which to send data. While a larger link can reduce the amount of time a congestion condition exists, in certain scenarios, there is no such thing as having enough bandwidth to meet all demands.
大多数网络都是建立在超额订阅模型之上的,其中一些较大量的聚合带宽在某些瓶颈处共享。例如,繁忙数据中心中的架顶式 (ToR) 交换机可能具有面向主机的 48x10GbE 端口,但只有 4x40GbE 端口面向数据中心的其余部分。这导致超额认购比例为 480:160,降至 3:1。隐含地,面向数据中心的 160 Gbps 带宽是面向主机的 480 Gbps 带宽的潜在瓶颈(拥塞点)。然而,3:1 的超额认购率在数据中心交换设计中很常见。为什么?
Most networks are built on a model of oversubscription, where some larger amount of aggregated bandwidth is shared at certain bottlenecks. For example, a Top of Rack (ToR) switch in a busy data center might have 48x10GbE ports facing hosts, but only have 4x40GbE ports facing the rest of the data center. This results in an oversubscription ratio of 480:160, which reduces to 3:1. Implicitly, the 160Gbps of data center facing bandwidth is a potential bottleneck—a congestion point—for the 480Gbps of host facing bandwidth. And yet, a 3:1 oversubscription ratio is common in data center switching designs. Why?
最终的答案往往是金钱。通常可以设计一个边缘端口与可用带宽相匹配的网络。例如,在上面给出的数据中心结构中,几乎肯定可以添加足够的链路容量来提供从 ToR 到结构的 480Gbps,但成本可能会令人望而却步。网络工程师不仅需要考虑端口和光纤的成本,还需要考虑额外电源的成本,以及添加必要的额外设备后控制环境所需的额外冷却的成本,甚至成本额外的机架空间和地板重量。
The ultimate answer is often money. It is often possible to design a network in which the edge ports match the available bandwidth. For instance, in the data center fabric given above, it is almost certainly possible to add enough link capacity to provide 480Gbps from the ToR into the fabric, but the cost may well be prohibitive. The network engineer needs to consider not only the costs of the port and fiber optics, but also the cost of additional power, and the cost of the additional cooling required to control the environment once the necessary additional devices have been added, and even the costs of additional rack space and floor weight.
如果网络或结构很少发生拥塞,则花钱提供更高的结构带宽也可能很难证明是合理的。某些拥塞事件的发生频率不够高,不足以证明昂贵的网络升级是合理的。当政客来访时,一个城市每年会花费数百万或数十亿美元来改善交通基础设施以缓解交通吗?不会。相反,我们会进行其他调整来解决交通问题。
Spending money to provide a higher fabric bandwidth may also be hard to justify if the network or fabric is rarely congested. Some congestion events are not frequent enough to justify an expensive network upgrade. Would a city spend millions or billions of dollars in transportation infrastructure improvements to ease traffic once a year when a politician comes to visit? No. Instead, other adjustments are made to handle the traffic problem.
例如,公司可能在广域网中最强烈地感受到这种限制,其中链路是从服务提供商 (SP) 租用的。服务提供商的盈利部分是通过为那些无力自行建设和运营长距离光缆的组织将不同地区连接在一起来实现的。这些长途链路通常提供比单个园区甚至单个建筑物内的较短本地链路低得多的带宽。园区或数据中心内的高速链路很容易压倒较慢的长途链路。
For example, companies might most keenly experience this constraint in wide area networking, where links are leased from service providers (SPs). SPs make their money, in part, by connecting disparate geographies together for organizations that cannot afford to build out and operate long-distance fiber-optic cables on their own. These long-haul links normally offer much lower bandwidth than the shorter, local, links on a single campus, or even within a single building. A high-speed link within a campus or data center can easily overwhelm slower long-haul links.
组织将尽可能合理地调整长途(例如站点间,甚至洲际)链接,但同样要记住的关键是金钱。SP 向其他组织提供的长途链路是一项成本高昂、通常重要且经常受到审查的预算项目。租用的带宽越多,成本往往就越高。结果是出现大量超额订阅,与园区或数据中心内的可用速度相比,WAN 链路的带宽受到极大限制。
Organizations will size long-haul (such as intersite, or even intercontinental) links as large as reasonably possible, but again, the key to keep in mind is money. Long-haul links provided by SPs to other organizations are a costly, usually significant and oft scrutinized budget item. The more bandwidth being leased, the higher the costs tend to be. The result is a massive oversubscription, where WAN links are greatly bandwidth constrained when compared to the speeds available on a campus or inside a data center.
在存在超额订阅和随之而来的拥塞点以及需要仔细管理的临时流量模式的世界中,始终需要 QoS 流量优先级方案。
In a world of oversubscription and consequent congestion points, as well as temporary traffic patterns that need careful management, QoS traffic prioritization schemes will always be required.
QoS 优先级方案作用于不同的流量类别,但什么是流量类别以及如何定义它?
QoS prioritization schemes act on different traffic classes, but what is a traffic class, and how is it defined?
流量类别表示聚合的流量组。来自需要类似处理或向网络呈现类似流量模式的应用程序的数据流被分组并由 QoS 策略(或服务等级,CoS)进行管理。这种分组至关重要,因为为可能无限数量的应用程序定义唯一的 QoS 策略将非常繁琐。出于实用性的考虑,网络工程师通常会将流量分为四类。更多的类当然是可能的,并且这样的方案确实存在于生产网络中。然而,随着类别数量超过四个,分类系统和政策行动的管理变得越来越繁琐。
Traffic classes represent aggregated groups of traffic. Data streams from applications requiring similar handling or presenting similar traffic patterns to the network are placed into groups and managed by a QoS policy (or Class of Service, CoS). This grouping is crucial, as it would be ponderous to define unique QoS policies for a potentially infinite number of applications. As a matter of practicality, network engineers will typically group traffic into four classes. More classes are certainly possible, and such schemes do exist in production networks. However, the management of the classification system and policy actions becomes increasingly tedious as the number of classes grows beyond four.
可以根据源地址、目标地址、源端口、目标端口、数据包大小和其他因素将每个数据包分配给特定的 CoS。假设每个应用程序都有自己的配置文件或一组特征,则每个应用程序都可以放入特定的 CoS 中,并按照本地 QoS 策略执行。这种流量分类方法的问题是分类仅在本地有意义——分类操作仅与执行分类的设备相关。
It is possible for each packet to be assigned to a particular CoS based on the source address, destination address, source port, destination port, size of the packet, and other factors. Assuming each application has its own profile, or set of characteristics, each application can be placed into a specific CoS, and acted on local QoS policy. The problem with this method of traffic classification is the classification is only locally significant—the classification action is relevant only to the device performing the classification.
以这种方式对数据包进行分类需要大量的时间,并且处理每个数据包将需要大量的处理能力。因此,最好不要在数据包经过的每个设备上重复此处理。相反,最好对流量进行一次分类,在该单点标记数据包,并在网络中的每个后续跃点对该标记进行操作。
Classifying packets in this way requires a lot of time, and processing each packet will take a lot of processing power. Because of this, it is still best not to repeat this processing at every device through which the packet passes. Instead, it is better to classify the traffic once, mark the packet at this single point, and act on this marking at every subsequent hop in the network.
笔记
Note
尽管数据包和帧在网络中是不同的,但本章仍将使用术语数据包。
Even though packets and frames are distinct in networking, the term packets will be used in this chapter.
各种标记方案已经被设计和标准化,例如互联网协议版本 4 (IPv4) 标头中包含的 8 位服务类型 (ToS) 字段。该协议的版本 6 (IPv6) 包括一个用于类似目的的 8 位流量类别字段。以太网帧使用 3 位字段作为 802.1p 规范的一部分。IPv4 ToS字段如图8-2所示。
Various marking schemes have been designed and standardized, such as the 8-bit Type of Service (ToS) field included in the Internet Protocol version 4 (IPv4) header. Version 6 of this same protocol (IPv6) includes an 8-bit Traffic Class field serving a similar purpose. Ethernet frames use a 3-bit field as part of the 802.1p specification. Figure 8-2 illustrates the IPv4 ToS field.
在网络最佳实践中,流量分类应该导致一项操作,并且只有一项操作——标记。当数据包被标记后,分配的值可以被保留并在数据包通过网络路径的整个旅程中起作用。分类和后续标记应该是数据包生命周期中“一次性完成”的事件。
In networking best practice, traffic classification should result in one action and one action only—marking. When a packet has been marked, the assigned value can be preserved and acted upon throughout the packet’s entire journey through the network path. Classification and subsequent marking should be a “one-and-done” event in the life of a packet.
QoS 最佳实践是将流量标记为尽可能靠近源。理想情况下,流量将在网络入口点进行标记。例如,从个人电脑、电话、服务器、物联网设备等流入网络交换机的流量将被标记,并且该标记将作为数据包在网络中的旅程的流量分类器。
QoS best practice is to mark traffic as closely to the source as possible. Ideally, traffic will be marked at the point of ingress to the network. For example, traffic flowing into a network switch from a personal computer, phone, server, IoT device, etc. will be marked, and the mark will serve as the traffic classifier on the packet’s journey through the network.
入口网络设备对流量进行分类和标记的替代方案是应用程序本身标记其自己的流量。换句话说,发送出去的数据包中已填充了 ToS 字节。这就带来了信任问题。是否应该允许应用程序对其自身的重要性进行排名?在最坏的情况下,所有应用程序都会自私地用指示最高可能重要性的值来标记其数据包。如果每个数据包都被标记为非常重要,那么实际上没有数据包是非常重要的。为了使一个数据包比其他数据包更重要,必须存在差异化。流量类别必须具有不同的重要性级别,QoS 优先级方案才有意义。
An alternate scheme to the ingress network device classifying and marking traffic is for the application itself to mark its own traffic. In other words, the packet is sent out with the ToS byte already populated. This brings up the problem of trust. Should an application be allowed to rank its own importance? In the worst-case scenario, all applications would selfishly mark their packets with values indicating the highest possible importance. If every packet is marked as being highly important, then in actuality, no packet is highly important. For one packet to be more important than any other, there must be differentiation. Traffic classes must have distinct levels of importance for QoS prioritization schemes to have any meaning.
为了保持对流量分类的控制,所有实施 QoS 的网络都具有信任边界。信任边界允许网络避免所有应用程序都将自己标记为重要的情况。想象一下,如果每辆车都配备闪烁应急灯,在拥堵的道路上会发生什么——真正重要的车辆将不会脱颖而出。
To maintain control over traffic classification, all networks implementing QoS have trust boundaries. Trust boundaries allow the network to avoid a situation where all applications have marked themselves as important. Imagine what would happen on a congested road if every vehicle had flashing emergency lights—the truly important vehicles would not stand out.
在网络中,一些应用程序和设备被信任来标记自己的流量。例如,IP 电话通常被信任能够适当地标记其流语音并控制协议流量,这意味着 IP 电话应用于其流量的标记会被入口网络设备接受。其他端点或应用程序可能不受信任,这意味着数据包的 ToS 字节在入口时被擦除或重写。默认情况下,大多数网络交换机都会删除发送给它们的标记,除非配置为信任特定设备。例如,服务器放置在数据包中的标记通常是可信的,而终端主机设置的标记则不然。图 8-3说明了信任边界。
In networking, some applications and devices are trusted to mark their own traffic. For example, IP phones are typically trusted to mark their streaming voice and control protocol traffic appropriately, meaning the marks that IP phones apply to their traffic are accepted by the ingress network device. Other endpoints or applications might be untrusted, meaning the packet’s ToS byte is erased or rewritten on ingress. By default, most network switches erase the marks sent to them unless configured to trust specific devices. For instance, makers placed in a packet by a server are often trusted, while markings set by an end host are not. Figure 8-3 illustrates a trust boundary.
图8-3中,B 发送的数据包标记为 AF41。由于这些数据包源自 QoS 信任域内的主机,因此标记在经过 D 时保持不变。源自 A 的数据包标有 EF;源自 A 的数据包标有 EF;然而,由于 A 位于 QoS 信任域之外,因此该标记在 D 处被剥离。从 QoS 角度来看,源自 A 的信任域内的数据包被视为未标记。物理层和上层协议标记可能相关,也可能不相关。例如,可以将上层标记复制到下层标记中,或者可以通过网络承载下层标记,或者可以剥离下层标记。有许多不同的可能实现,因此您应该仔细了解跨层以及每一跳处理标记的方式。
In Figure 8-3, packets being transmitted by B are marked with AF41. As these packets are originating from a host within the QoS trust domain, the markings remain as they pass through D. Packets originating from A are marked with EF; however, since A is outside the QoS trust domain, this marking is stripped at D. Packets within the trust domain originating at A are seen as unmarked from a QoS perspective. The physical layer and upper layer protocol markings may, or may not, be related. For instance, the upper layer markings may be copied into the lower layer markings, or the lower layer markings may be carried through the network, or the lower layer markings may be stripped. There are many different possible implementations, so you should be careful to understand the way the markings are being handled across layers, as well as at each hop.
尽管网络运营商可以使用他们在 ToS 字节中选择的任何值来创建不同的流量类别,但通常最好坚持使用某些标准,例如 IETF RFC 标准定义的值。这些标准的定义是为了给网络工程师提供一个逻辑方案来正确区分许多不同的流量类别。
Although network operators can use any values they choose in the ToS byte to create distinct traffic classes, it is often best to stick with some standard, such as the values defined by IETF RFC standards. These standards were defined to give network engineers a logical scheme to appropriately distinguish many different traffic classes.
其中两个“每跳行为”方案出现在 RFC2597(保证转发 (AF))和 RFC3246(加速转发(EF))中,以及其他各种 RFC 更新或澄清了这些基础文档的内容。这两个 RFC 都定义了流量标记方案,包括应填充 IP 标头的 ToS 字节或流量类别字节以指示特定流量类型的确切位值。这些被称为差分服务代码点或 DSCP 值。
Two of these “Per Hop Behavior” schemes appear in RFC2597, Assured Forwarding (AF), and RFC3246, Expedited Forwarding (EF), with various other RFCs updating or clarifying the content of these foundational documents. Both of these RFCs define traffic marking schemes, including the exact bit values that should populate the ToS byte or Traffic Class byte of an IP header to indicate a specific type of traffic. These are known as Differentiated Service Code Points, or DSCP values.
例如,RFC2597 的有保证转发方案在按位分层方案中定义了 12 个值,以填充 ToS 字节字段中的 8 位。前三位用于标识类别,而后三位标识丢弃优先级。最后两位未使用。表 8-1说明了几个 AF 类别的代码标记。
For example, RFC2597’s assured forwarding scheme defines 12 values in a bitwise hierarchical scheme to populate the eight bits found in the ToS byte field. The first three bits are used to identify a class while the second three bits identify a drop precedence. The final two bits are unused. Table 8-1 illustrates the code markings for several AF classes.
Table 8-1 Assured Forwarding Class of Service Quality of Service Markings
|
1 级 (001) Class 1 (001) |
2 级 (010) Class 2 (010) |
三级 (011) Class 3 (011) |
4 级 (100) Class 4 (100) |
||||
低落差 Low Drop |
001 010 001 010 |
AF11 AF11 |
010 010 010 010 |
AF21 AF21 |
011 010 011 010 |
AF31 AF31 |
100 010 100 010 |
AF41 AF41 |
中等落差 Medium Drop |
001 100 001 100 |
AF12 AF12 |
010 100 010 100 |
AF22 AF22 |
011 100 011 100 |
AF32 AF32 |
100 100 100 100 |
AF42 AF42 |
高落差 High Drop |
001110 001 110 |
AF13 AF13 |
010110 010 110 |
AF23 AF23 |
011110 011 110 |
AF33 AF33 |
100 110 100 110 |
AF43 AF43 |
表8-1显示了AF11(具有低丢弃优先级的1类流量)的DSCP位值为001 010,其中“001”表示1类,“010”表示丢弃优先级。更深入地检查该表可以揭示 RFC 作者选择的二进制模式。所有 1 类流量的前三位均标记为 001,所有 2 类流量的前三位均标记为 010,以此类推。所有低丢弃优先级流量的后三位均标记为 010,所有中丢弃优先级流量的后三位均标记为 100。第二个三位,依此类推。
Table 8-1 shows the DSCP bit value for AF11, traffic of Class 1 with a low drop precedence, is 001 010, where “001” indicates Class 1, and “010” indicates the drop precedence. Examining the table more deeply reveals the binary pattern selected by the RFC authors. All Class 1 traffic is marked with 001 in the first three bits, all Class 2 with 010 in the first three bits, etc. All Low Drop Precedence traffic is marked with 010 in the second three bits, all Medium Drop Precedence traffic with 100 in the second three bits, etc.
Assured Forwarding方案如表8-2所示。它并不是 QoS 流量分类中使用的代码点的明确列表。例如,RFC2474 中描述的类选择器方案的存在是为了与 IP 优先级标记方案向后兼容。IP 优先级仅使用 ToS 字节的前三位,总共有八个可能的类别。类选择器也使用八个值,用有效值填充六位 DSCP 字段的前三位(与传统 IP 优先级方案匹配),最后三位用零填充。表 8-2显示了这些类选择器。
The Assured Forwarding scheme is shown in Table 8-2 to illustrate. It is not meant to be a definitive list of code points used in QoS traffic classification. For example, the Class Selector scheme described in RFC2474 exists for backward compatibility with the IP Precedence marking scheme. IP Precedence used only the first three bits of the ToS byte, for a total of eight possible classes. The Class Selector uses eight values as well, populating the first three bits of the six-bit DSCP field with significant values (matching the legacy IP Precedence scheme), and the last three bits with zeros. Table 8-2 shows these class selectors.
Table 8-2 Class Selectors from RFC2474
CS0 CS0 |
000 000 000 000 |
CS1 CS1 |
001000 001 000 |
CS2 CS2 |
010 000 010 000 |
CS3 CS3 |
011 000 011 000 |
CS4 CS4 |
100 000 100 000 |
CS5 CS5 |
101 000 101 000 |
CS6 CS6 |
110 000 110 000 |
CS7 CS7 |
111 000 111 000 |
RFC3246 定义了必须快速转发的流量的延迟、丢失和抖动要求,以及一个新的代码点 EF,该代码点被分配了二进制值 101 110(十进制 46)。
RFC3246 defines the latency, loss, and jitter requirements of traffic that must be forwarded expeditiously, along with a single new code point—EF, which is assigned binary value 101 110 (decimal 46).
正式定义的 DSCP 值的数量和种类似乎令人难以承受。AF、CS 和 EF 的组合定义单独产生了使用 DSCP 字段的 6 位可能的 64 个不同类别中的 21 个不同类别的正式定义。网络工程师是否希望在其 QoS 优先级方案中使用所有这些值?是否应该以如此细的粒度对流量进行分解以获得有效的 QoS?
The quantity and variety of formally defined DSCP values might seem overwhelming. The combined definitions of AF, CS, and EF alone result in formal definitions for 21 different classes out of a possible 64 using the six bits of the DSCP field. Are network engineers expected to use all of these values in their QoS prioritization schemes? Should traffic be broken down with such fine granularity for effective QoS?
实际上,大多数 QoS 方案将自身限制在 4 到 8 个流量类别之间。不同的类别允许每个群体在拥堵期间得到独特的对待。例如,可以调整一个流量类别以满足特定的带宽阈值。另一个流量类别的优先级可能高于所有其他流量。另一种可能被定义为业务关键型,或者比大多数流量更重要但比某些流量更不重要的流量。对于基础设施稳定性至关重要的网络协议流量可以被视为非常高的优先级。清道夫流量类别可能位于优先级列表的底部附近,比未标记的流量受到稍微更多的关注。
In practice, most QoS schemes limit themselves to between four and eight traffic classes. The different classes allow for each group to be treated uniquely during times of congestion. For example, one traffic class might be shaped to meet a specific bandwidth threshold. Another traffic class might be prioritized above all other traffic. Yet another might be defined as business-critical, or traffic that is more important than most but less important than some. Network protocol traffic critical for infrastructure stability could be treated as very high priority. A scavenger traffic class might be near the bottom of the priority list, receiving slightly more attention than unmarked traffic.
包含这些值的方案可能是各种 RFC 中定义的代码点的混合,并且可能因组织而异。通常接受的值包括用于具有及时性要求的关键流量(例如 VoIP)的 EF,以及用于网络控制流量(例如路由和第一跳冗余协议)的 CS6。未标记的流量(即,DSCP 值为 0)是尽力传送的,不保证服务级别(这通常被视为清除类,如上所述)。
A scheme incorporating these values is likely to be a mix of code points defined in the various RFCs and could vary somewhat from organization to organization. Generally accepted values include EF for critical traffic with a timeliness requirement such as VoIP, and CS6 for network control traffic such as routing and first hop redundancy protocols. Unmarked traffic (i.e., a DSCP value of 0) is delivered on a best-effort basis, with no guarantee of service level made (this would generally be considered the scavenger class, as above).
RFC2597 和 RFC3246 中提到的一个有趣的问题是标记数据包通过隧道传输时标记保存的问题。当数据包通过隧道传输时,原始数据包被包装或封装在新的 IP 数据包内。ToS 字节值位于现在封装的数据包的 IP 标头内。呃哦。精心设计的流分类方案到底怎么了?答案是网络设备在建立隧道时会参与ToS 反射。反射过程如图8-4所示。
An interesting problem mentioned in both RFC2597 and RFC3246 is the issue of mark preservation when a marked packet is tunneled. When a packet is tunneled, the original packet is wrapped—or encapsulated—inside of a new IP packet. The ToS byte value is inside the IP header of the now-encapsulated packet. Uh oh. What just happened to the carefully crafted traffic classification scheme? The answer is network devices engage in ToS reflection when tunneling. Figure 8-4 shows the reflection process.
当数据包通过隧道传输时,封装数据包中的 ToS 字节值将复制(或反映)到隧道数据包的 IP 标头中。这保留了隧道应用程序的流量分类。
When a packet is tunneled, the ToS byte value in the encapsulated packet is copied (or reflected) in the IP header of the tunnel packet. This preserves the traffic classification of the tunneled application.
当将标记的流量从您控制的网络域发送到您不控制的网络域时,也会遇到类似的挑战。最常见的示例是将标记的流量从局域网发送到服务提供商的网络,穿过其广域网。作为提供连接的合同的一部分,服务提供商通常也提供差异化的服务级别。然而,为了让他们能够提供差异化服务,必须以他们可以识别的方式标记流量。考虑到可能的标记方案的绝对数量,他们的标记方案不太可能与您的标记方案相同。
A similar challenge comes when sending marked traffic from a network domain you control into one you do not. The most common example is sending marked traffic from your local area network into the network of your service provider, traversing its wide area network. Service providers, as a part of the contract to provide connectivity, often provide differentiated levels of service as well. However, for them to be able to provide differentiated service, traffic must be marked in a way they can recognize. Their marking scheme is unlikely to be the same as your marking scheme, considering the sheer number of possible marking schemes possible.
针对这一困境,有几种解决方案:
A couple of solutions to this dilemma present themselves:
• DSCP 突变:在这种情况下,LAN 和WAN 之间边界上的网络设备将标记从LAN 上分配的原始值转换为SP 将遵守的新值。该转换是根据网络工程师配置的表来执行的。
• DSCP mutation: In this scenario, the network device on the border between the LAN and the WAN translates the mark from the original value assigned on the LAN into a new value the SP will honor. The translation is performed in accordance with a table configured by a network engineer.
• DSCP 转换: SP 仅观察ToS 字节的前三位的情况并不少见,这可以追溯到RFC791 中定义的IP 优先级的时代。
• DSCP translation: It is not uncommon for SPs to observe only the first three bits of the ToS byte, hearkening back to the days of IP Precedence defined all the way back in RFC791.
在第二个解决方案中,网络工程师面临着使用六位创建现代 DSCP 标记方案,即使 SP 只关注前三个。挑战在于保持差异化。例如,考虑表 8-3所示的方案;该方案无法解决该问题。
In the second solution, the network engineer is faced with creating a modern DSCP marking scheme using six bits, even though the SP will pay attention to just the first three. The challenge is to maintain differentiation. For example, consider the scheme illustrated in Table 8-3; this scheme will not resolve the issue.
Table 8-3 Translating DSCP to IP Precedence
DSCP(局域网,6 位) DSCP (LAN, 6 bits) |
优先级(SP WAN,3 位) PRECEDENCE (SP WAN, 3 bits) |
EF / 101 110 EF / 101 110 |
101 101 |
CS5 / 101 000 CS5 / 101 000 |
101 101 |
AF23 / 010 110 AF23 / 010 110 |
010 010 |
AF13 / 001 110 AF13 / 001 110 |
001 001 |
AF12 / 001 100 AF12 / 001 100 |
001 001 |
AF11 / 001 010 AF11 / 001 010 |
001 001 |
在此表中,定义了六个唯一的 DSCP 值以供局域网使用。然而,如果服务提供商仅尊重前三位,则这六个唯一值将减少为仅三个唯一值。这意味着一些在进入提供商网络之前可能享受差异化待遇的流量现在将被集中到同一个存储桶中。在示例中,以前唯一的 EF 和 CS5 在离开边界路由器时属于同一类,因为 EF 和 CS5 的前三位都是 101。AF11、AF12 和 AF13 也是如此,这三个以前不同的类别现在,在穿越 SP WAN 时,将同等对待这些流量类别,因为它们在最初的三位中都共享相同的初始 001 值。
In this table, six unique DSCP values have been defined for use on the local area network. However, these six unique values are reduced to only three unique values if only the first three bits are honored by the service provider. This means some traffic that might have enjoyed differentiated treatment before entering the provider’s network will now be lumped into the same bucket. In the example, EF and CS5, formerly unique, fall into the same class when they leave the border router, as the initial three bits of EF and CS5 are both 101. The same goes for AF11, AF12, and AF13—three formerly distinct traffic classes that will now be treated identically while traversing the SP WAN, as they all share the same initial 001 value in the initial three bits.
解决此问题的一种方法是创建一个 DSCP 标记方案,该方案将保持前三位的唯一性,如表 8-4所示。然而,这可能需要减少流量类别的总数。将方案限制为前三位来定义类别会将类别总数减少到最多六个。
A way to solve this problem is to create a DSCP marking scheme that will maintain uniqueness in the first three bits as demonstrated in Table 8-4. This might require a reduction in the overall number of traffic classes, however. Limiting the scheme to the first three bits to define classes will reduce the total number of classes to maximum of six.
Table 8-4 Translating DSCP to IP Precedence
DSCP(局域网,6 位) DSCP (LAN, 6 bits) |
优先级(SP WAN,3 位) PRECEDENCE (SP WAN, 3 bits) |
EF / 101 110 EF / 101 110 |
101 101 |
AF41 / 100 010 AF41 / 100 010 |
100 100 |
AF31 / 011 010 AF31 / 011 010 |
011 011 |
AF21 / 010 010 AF21 / 010 010 |
010 010 |
AF11 / 001 010 AF11 / 001 010 |
001 001 |
CS0 / 000 000 CS0 / 000 000 |
000 000 |
表 8-4显示了使用 EF、AF 和类选择器值混合的标记方案,这些值是专门为保持前三位的唯一性而选择的。
Table 8-4 shows a marking scheme using a mix of EF, AF, and Class Selector values especially chosen to preserve uniqueness in the first three bits.
到目前为止,本讨论假设网络设备将尊重 IP 数据包中的标记。当然,在私有网络和租赁网络中确实如此,其中信任条款已与服务提供商协商。但全球互联网上会发生什么?为公共互联网流量提供服务的网络设备是否遵守并遵守 DSCP 值,并在拥塞期间优先考虑某些流量而不是其他流量?从互联网消费者的角度来看,答案是否定的。公共互联网是一种尽力而为的传输方式。无法保证流量的均匀传输,更不用说流量优先级了。
So far, this discussion assumes network devices will honor the marks found in an IP packet. Certainly, this is true in privately owned networks and on leased networks where the terms of trust have been negotiated with a service provider. But what happens on the global Internet? Do network devices servicing public Internet traffic observe and honor DSCP values, and prioritize some traffic over other traffic during times of congestion? From the perspective of Internet consumers, the answer is no. The public Internet is a best effort transport. There are no guarantees of even traffic delivery, let alone traffic prioritization.
即便如此,全球互联网越来越多地被用作私人设施之间传输的广域传输。廉价的宽带互联网服务有时比从服务提供商租用的专用 WAN 电路以更低的成本提供更多的带宽。这种较低成本的代价是服务水平较低,而且通常要低得多。廉价的互联网线路之所以便宜,是因为它们不提供服务水平保证,至少没有足够有意义的保证来激发人们对及时传输流量的信心(如果有的话)。虽然可以标记发往 Internet 的流量,但 ISP 不会注意这些标记。当互联网被用作广域网传输时,如何将QoS策略有效地应用于流量呢?
Even so, the global Internet is increasingly being used as a wide area transport for traffic carried between private facilities. Cheap broadband Internet service sometimes offers more bandwidth at a lower cost than private WAN circuits leased from a service provider. The tradeoff for this lower cost is a lower level of service, often substantially lower. Cheap Internet circuits are cheap because they do not offer service level guarantees, at least not ones meaningful enough to inspire confidence in the timely delivery of traffic (if at all). While it is possible to mark traffic destined for the Internet, the ISP will not pay attention to the marks. When the Internet is being used as a WAN transport, how then can a QoS policy be effectively applied to traffic?
在公共互联网上创建服务质量需要重新考虑 QoS 优先级方案。对于专用网络运营商来说,公共互联网是一个黑匣子。私有运营商无法控制私有 WAN 边缘之间的公共路由器。如果不控制中间的公共互联网路由器,私人运营商不可能在拥塞的公共互联网链路上将某些流量优先于其他流量。
Creating a Quality of Service over the public Internet requires a rethinking of QoS prioritization schemes. To the private network operator, the public Internet is a black box. The private operator has no control over the public routers between the edges of the private WAN. It is not possible for the private operator to prioritize certain traffic over other traffic on a congested public Internet link without control over the intermediate, public Internet router.
通过公共互联网提供服务质量的解决方案是多方的:
The solution to providing Quality of Service over the public Internet is multipartite:
• 在流量进入公共互联网黑匣子之前,流量控制发生在专用网络边缘。这是专用网络运营商拥有设备控制权的最后一点。
• Control over traffic happens at the private network edge, before the traffic enters the public Internet’s black box. This is the last point at which the private network operator has device control.
• QoS 策略主要通过路径选择来实施,其次通过拥塞管理来实施。
• QoS policy is enforced primarily through path selection and secondarily via congestion management.
笔记
Note
有关使用流量工程管理 QoS 问题的更多信息,请参阅第 17 章“控制平面中的策略”。
See Chapter 17, “Policy in the Control Plane,” for more information on using traffic engineering to manage QoS problems.
路径选择概念中隐含的是存在多个可供选择的路径。在新兴的软件定义广域网 (SD-WAN) 模型中,两个或多个 WAN 电路被视为带宽池。在池中,当池边缘的网络设备沿每个可用电路或路径执行质量测试时,会实时决定用于在任何给定时间承载流量的单独电路。根据任意时间点的路径特性,流量可能会沿着一条路径或另一条路径发送。
Implicit in the notion of path selection is the existence of more than one path to select from. In the emerging Software-Defined Wide Area Network (SD-WAN) model, two or more WAN circuits are treated as a bandwidth pool. In the pool, the individual circuit used to carry traffic at any given time is decided on a moment-by-moment basis as the network devices at the edge of the pool perform quality tests along each available circuit or path. Depending on a path’s characteristics at any point in time, traffic may be sent down one path or another.
哪些流量沿着哪条路径发送?SD-WAN 提供精细的流量分类功能,超出了由 ToS 字节上的 DSCP 标记定义的人类可管理的四到八个类别。SD-WAN 路径选择策略可以根据应用程序进行定义,并做出细致入微的转发决策。这与标记尽可能靠近源,然后根据标记在拥塞期间做出转发决策的想法不同。相反,SD-WAN将实时路径特征与实时分类的应用程序的策略定义需求进行比较,然后做出实时路径选择决策。
Which traffic is sent down which path? SD-WAN offers granular traffic classification capabilities beyond the human-manageable four to eight classes defined by DSCP marks imposed on the ToS byte. SD-WAN path selection policy can be defined on an application-by-application basis, with nuanced forwarding decisions made. This is distinct from the idea of marking as close to the source as possible, and then making forwarding decisions during congestion times based on the mark. Rather, SD-WAN compares real-time path characteristics with the policy-defined needs of applications classified in real time, and then makes a real-time path selection decision.
结果应该是类似于完全拥有的私有 WAN 的应用程序用户体验,并具有管理拥塞的 QoS 优先级方案。然而,用于实现类似结果的机制却有很大不同。SD-WAN 的功能取决于检测问题并快速重新路由流量的能力,而不是在拥塞问题发生后对其进行管理。SD-WAN 技术不会取代 QoS;相反,它们为底层网络不支持 QoS 的情况提供了“顶层”选项。
The result should be an application user experience similar to a wholly owned private WAN with a QoS prioritization scheme managing congestion. The mechanisms used to achieve this similar result are substantially different, however. The functionality of SD-WAN hinges on the ability to detect and quickly reroute traffic flows around a problem, as opposed to managing a congestion problem once it has happened. SD-WAN technologies do not replace QoS; rather they provide an “over the top” option for situations where QoS is not supported on the underlying network.
分类本身不会导致网络设备采取特定的转发姿势。相反,对流量进行分类是创建差异化转发行为框架的第一个必要步骤。也就是说,数据包已经被分类和区分了,仅此而已。指出差异并不等同于对这些类别采取差异化的行动。
Classification, by itself, does not result in a specific forwarding posture on the part of a network device. Rather, classifying traffic is the first necessary step in creating a framework for differentiated forwarding behavior. In other words, the packets have been classified and differentiated, but that is all. Pointing out differences is not the same as taking differentiated actions on those classes.
我们对 QoS 的讨论现在进入了策略领域。如何管理拥塞的接口?当数据包等待发送时,网络设备如何决定先发送哪些数据包?决策点主要基于用户体验对数据包抖动、延迟和丢失的容忍程度。出现了各种问题和 QoS 工具来解决这些问题。
Our discussion of QoS now moves into the realm of policy. How are congested interfaces managed? When packets are waiting for delivery, how does a network device decide which packets are sent first? The decision points are based primarily around how well the user experience can tolerate packet jitter, latency, and loss. A variety of problems and QoS tools present themselves to address these issues.
网络接口尽快转发数据包。当流量小于或等于出口接口的带宽时,流量会一次一个数据包地传输,不会出现戏剧性的情况。当接口能够满足对其施加的需求时,就不会出现拥塞。没有拥塞,就不用担心流量类型的差异化。出于统计目的,可能会观察各个数据包上的标记,但不需要应用 QoS 策略。流量到达出接口并被下发。
Network interfaces forward packets as quickly as possible. When traffic is flowing at less than or equal to the bandwidth of the egress interface, traffic is delivered, one packet at a time, without drama. When an interface can keep up with the demands being placed on it, there is no congestion. Without congestion, there is no concern about differentiated traffic types. The marks on the individual packets might be observed for statistical purposes, but there is no QoS policy that needs to be applied. Traffic arrives at the egress interface and is delivered.
正如第 7 章“数据包交换”中对通过路由器的交换路径的描述中所述,数据包在交换后被传送到传输环。出站接口的物理处理器从该环中删除数据包并计时到物理网络介质上。如果要传输的数据包数量超出了链路所能支持的数量,会发生什么情况?在这种情况下,数据包被放置在队列(输出队列)中,而不是放置在传输环上。路由器上配置的QoS策略实际上是在将报文从输出队列中取出到传输环上进行传输的过程中实现的。当数据包被放置在输出队列而不是传输环上时,该接口被认为是拥塞的。
As described in the description of the switching path through a router in Chapter 7, “Packet Switching,” packets are delivered to a transmit ring after being switched. The outbound interface’s physical processor removes packets from this ring and clocked onto the physical network medium. What happens if there are more packets to be transmitted than the link can support? In this case, the packets are placed in a queue, the output queue, rather than on the transmit ring. The QoS policies configured on the router are actually implemented in the process of removing packets from the output queue onto the transmit ring for transmission. When packets are being placed on the output queue, rather than the transmit ring, the interface is said to be congested.
默认情况下,拥塞的网络接口以先进先出 (FIFO) 的方式传送数据包。FIFO不根据不同的流量类别做出策略决策;相反,FIFO 只是按照出口接口允许的速度按顺序为缓冲的数据包提供服务。对于许多应用程序来说,FIFO 是一种使数据包出列的好方法。例如,如果来自一台 Web 服务器的超文本传输协议(HTTP,用于承载万维网信息的协议)数据包在来自另一台 Web 服务器的数据包之前传输,则可能不会对现实世界产生什么影响。
By default, congested network interfaces deliver packets on a first-in, first-out (FIFO) basis. FIFO does not make a policy decision based on differentiated traffic classes; rather FIFO simply services buffered packets in order, as quickly as the egress interface will allow. For many applications, FIFO is not a bad way to go about dequeueing packets. For instance, there might be little real-world impact if a Hypertext Transfer Protocol (HTTP, the protocol used to carry World Wide Web information) packet from one web server is transmitted before one from a different web server.
对于其他流量类别,人们非常关心及时性。与 FIFO 不同,一些数据包应移至队列头部并尽快发送,以避免延迟并影响最终用户体验。其中一种影响是数据包到达太晚而无法发挥作用。另一个影响是数据包根本没有到达。值得考虑每个场景以及针对每个场景的一些有用的 QoS 工具。
For other traffic classes, there is a great deal of concern about timeliness. As opposed to FIFO, some packets should be moved to the head of the queue and sent as quickly as possible to avoid delay and an impact to the end user experience. One impact is in the form of a packet arriving too late to be useful. Another impact is in the form of a packet not arriving at all. It is worth considering each of these scenarios and then some helpful QoS tools for each.
IP 语音 (VoIP) 流量必须按时交付。在考虑语音流量时,请考虑使用 Skype 等应用程序通过 Internet 进行的任何实时语音聊天。大多数时候,通话质量都不错。你可以听到对方的声音。那个人可以听到你的声音。谈话正常进行。你最好和对方在同一个房间,即使他在全国各地。
Voice over IP (VoIP) traffic must be both delivered and delivered on time. When considering voice traffic, think of any real-time voice chatting performed over the Internet using an application such as Skype. Most of the time, the call quality is decent. You can hear the other person. That person can hear you. The conversation flows normally. You might as well be in the same room with the other person, even if he is across the country.
有时,VoIP 通话质量可能会下降。您可能会听到此人的声音中有一系列亚秒级的口吃,其中发声速度不规则。在这种情况下,您会遇到抖动,这意味着数据包没有一致地及时到达。数据包间间隙过长会导致听得见的卡顿效应。虽然没有数据包丢失,但它们没有及时沿着网络路径传送。沿着路径的某个地方,数据包被延迟了足够长的时间,从而引入了可听的伪影。图8-5说明了数据包传输中的抖动。
On occasion, VoIP call quality might drop. You might hear a series of subsecond stutters in the person’s voice, where the speed of vocal delivery is irregular. In this case, you are experiencing jitter, which means packets are not arriving consistently in time. Overly long interpacket gaps result in an audible stuttering effect. While no packets were lost, they weren’t delivered along the network path in a timely fashion. Somewhere along the path, the packets were delayed long enough to introduce audible artifacts. Figure 8-5 illustrates jitter in packet transmission.
VoIP 通话质量还会受到数据包丢失的影响,即网络路径中的数据包沿途被丢弃。虽然网络路径中数据包丢失的潜在原因有很多,但这里考虑的情况是尾部丢弃,即到达的流量超出了出口接口的承受能力,以至于缓冲区中没有剩余空间来排队额外的多余流量。结果,最新到达的流量被丢弃;这种下降称为尾部下降。
VoIP call quality can also suffer from packet loss, where packets in the network path were dropped along the way. While there are many potential reasons for packet loss in network paths, the scenario considered here is tail drop, where so much traffic has arrived beyond the egress interface’s capability to keep up that there is no room left in the buffer to queue up additional excess. The latest traffic arrivals are discarded as a result; this drop is called tail drop.
当 VoIP 流量被尾部丢弃时,听众会听到丢失的结果。有一些间隙,说话者的声音完全消失。丢弃的数据包可能会以沉默的形式出现,因为接收到的最后一点声音被循环播放以填补间隙、延长的嘶嘶声或其他数字噪音。图 8-6说明了路由器或交换机上丢弃的数据包。
When VoIP traffic is being tail dropped, the listener hears the result of the loss. There are gaps where the speaker’s voice is completely missing. Dropped packets could come through as silence, as the last bit of received sound being looped as a way to fill the gap, an extended hiss, or other digital noise. Figure 8-6 illustrates dropped packets across a router or switch.
为了提供一致的呼叫质量,即使面对拥塞的网络路径,也必须应用 QoS 优先级方案。该方案必须满足以下标准。
To deliver consistent call quality, even in the face of a congested network path, a QoS prioritization scheme must be applied. This scheme must meet the following criteria.
•必须传送VoIP 流量:丢失的VoIP 数据包会导致通话中断。
• VoIP traffic must be delivered: Lost VoIP packets result in an audible drop in the conversation.
• VoIP 流量必须按时传送:延迟或抖动的 VoIP 数据包会导致听到断断续续的声音。
• VoIP traffic must be delivered on time: Delayed or jittery VoIP packets result in audible stutters.
• VoIP 流量不得占用其他流量类别的带宽:与 VoIP 一样重要的是,精心编写的 QoS 策略将平衡语音数据包的及时传送与其他流量类别也使用该链路的需要。
• VoIP traffic must not starve other traffic classes of bandwidth: As important as VoIP is, well-written QoS policies will balance timely delivery of voice packets with the need for other traffic classes to also use the link.
用于对对丢失和抖动敏感的流量进行优先级排序的常见方案是低延迟排队 (LLQ)。没有 IETF RFC 定义 LLQ;相反,网络设备供应商发明了 LLQ 作为 QoS 策略工具箱中的一个工具,用于对需要低延迟、抖动和丢失的流量(例如语音)进行优先级排序。
A common scheme deployed to prioritize traffic sensitive to loss and jitter is low-latency queueing (LLQ). No IETF RFC defines LLQ; rather, network equipment vendors invented LLQ as a tool in the QoS policy toolbox to prioritize traffic requiring low delay, jitter, and loss—such as voice.
LLQ 有两个关键要素。
There are two key elements to LLQ.
• LLQ 服务的流量会尽快传输以避免延迟,从而最大限度地减少抖动。
• Traffic serviced by LLQ is transmitted as quickly as possible to avoid delay, minimizing jitter.
• LLQ 服务的流量不允许超过指定的带宽量(通常建议不超过可用带宽的30%)。超过带宽限制的流量将被丢弃而不是被传输。该技术避免了其他流量类别的匮乏。
• Traffic serviced by LLQ is not allowed to exceed a specified amount of band-width (generally recommended to be no more than 30% of the available bandwidth). Traffic exceeding the bandwidth limit is dropped rather than transmitted. This technique avoids starving other traffic classes.
该方案隐含了 LLQ 对流量类别服务的折衷。流量将尽快得到服务,一旦出现在拥塞的接口上,就会有效地将其移动到队列的头部。问题是,以这种方式处理此类中的流量是有限制的。该限制是由制定 QoS 策略的网络工程师施加的。
Implied in this scheme is a compromise for traffic classes services by the LLQ. The traffic will be serviced as quickly as possible, effectively moving it to the head of the queue as soon as it shows up at a congested interface. The catch is that there is a limit on just how much traffic in this class will be treated in this way. That limit is imposed by a network engineer composing the QoS policy.
作为说明,假设 WAN 链路具有 1,024Kbps 的可用带宽。此链路将总部办公室连接到服务提供商 WAN 云,该云还将多个远程办公室连接回总部。这是一条繁忙的 WAN 链路,承载办公室之间的 VoIP 流量,以及时不时的 Web 应用程序流量和备份流量。此外,假设 VoIP 系统使用每次对话需要 64Kbps 的编解码器对语音流量进行编码。
By way of illustration, assume a WAN link with 1,024Kbps of available bandwidth. This link connects the headquarters office to the service provider WAN cloud, which also connects several remote offices back to HQ. This is a busy WAN link, carrying VoIP traffic between offices, as well as web application traffic and backup traffic from time to time. Furthermore, assume the VoIP system is encoding voice traffic with a codec requiring 64Kbps per conversation.
理论上,这个 1,024Kbps 链路可以容纳 16 × 64Kbps 同步 VoIP 会话。然而,这不会为存在的其他流量类型留下空间。这是一个繁忙的 WAN 链路!在制定QoS策略时,必须做出决定。LLQ 将允许多少语音对话以避免剩余带宽流量不足?可以选择将 LLQ 限制为仅 512Kbps 带宽,这足以处理八个同时会话,而将 WAN 链路的其余部分留给其他流量类别。
In theory, this 1,024Kbps link could accommodate 16 × 64Kbps simultaneous VoIP conversations. However, this would leave no room for the other traffic types that are present. This is a busy WAN link! In the writing of the QoS policy, a decision must be made. Just how many voice conversations will be allowed by the LLQ to avoid starving the remaining traffic of bandwidth? A choice could be made to limit the LLQ to only 512Kbps of bandwidth, which would be adequate to handle eight simultaneous conversations, leaving the rest of the WAN link for other traffic classes.
假设链路拥塞,链路必须处于这种情况才能使 QoS 策略生效,那么第九个 VoIP 会话会发生什么情况?这个问题实际上是一个天真的问题,因为它假设每个会话都由 QoS 策略单独处理。事实上,QoS 策略将 LLQ 服务的所有流量视为一大组数据包。一旦第九个 VoIP 会话加入,将有 576Kbps 的流量由仅分配了 512Kbps 的 LLQ 提供服务。要找到丢弃的流量量,请从提供的总流量中减去为 LLQ 预留的总流量: 576Kbps – 512Kbps = 64Kbps 的 LLQ 流量将被丢弃,以符合带宽上限。下降的 64Kbps 将来自整个 LLQ 流量类别,从而影响所有 VoIP 对话。如果是十分之一,第十一、十二个VoIP通话如果加入LLQ,问题就会变得更加严重。在这种情况下,64Kbps × 4 = 256Kbps 的不合格流量将从 LLQ 中丢弃,从而导致所有 VoIP 会话出现更多丢失。
Assuming the link is congested, the situation the link must be in for the QoS policy to be effective, what would happen to the ninth VoIP conversation? This question is actually a naive one, because it assumes each conversation is being handled separately by the QoS policy. In fact, the QoS policy treats all traffic being serviced by the LLQ as one large group of packets. Once the ninth VoIP conversation joins, there will be 576Kbps’ worth of traffic to be serviced by an LLQ that only has 512Kbps allocated. To find the amount of dropped traffic, subtract the total traffic set aside for the LLQ from the total traffic offered: 576Kbps – 512Kbps = 64Kbps’ worth of LLQ traffic will be dropped to conform to the bandwidth cap. The dropped 64Kbps will come from the LLQ traffic class as a whole, impacting all of the VoIP conversations. If a tenth, eleventh, and twelfth VoIP conversation were to join the LLQ, the problem would become more severe. In this case, 64Kbps × 4 = 256Kbps’ worth of nonconforming traffic that would discarded from the LLQ, causing even more loss from all of the VoIP conversations.
如本示例所示,管理拥塞需要了解应用程序组合、峰值负载时间、带宽需求和可用的网络架构选项。只有考虑到所有要点,解决方案才能满足业务需求目标要落实到位。例如,由于成本限制,假设 1,024Kbps 是您可以建立长途链路的最大速率。您可以将 LLQ 带宽限制提高到 768Kbps,以容纳 12 个对话,每个对话的速率为 64Kbps。但是,这将只为其他流量留下 256Kbps,这可能不足以满足您对其他应用程序的业务需求。
As this example shows, managing congestion requires knowledge of the application mix, peak load times, bandwidth demands, and network architecture options available. Only when all points are considered can a solution meeting business objectives be put in place. For instance, assume 1,024Kbps is the largest you can make the long-haul link due to cost constraints. You could raise the LLQ bandwidth limitation to 768Kbps to accommodate 12 conversations at 64Kbps each. However, this would leave only 256Kbps for other traffic, which perhaps is not enough to meet your business needs for other applications.
在这种情况下,可以与语音系统管理员协调以使用需要较少带宽的语音编解码器。如果部署每次呼叫仅需要 16Kbps 带宽(而不是原来的 64Kbps)的新编解码器,则可以通过分配了 512Kbps 带宽的 LLQ 无损转发 32 个 VoIP 对话。妥协?语音质量。与以 16Kbps 编码的人声相比,以 64Kbps 编码的人声听起来更加清晰自然。以 16Kbps 进行编码可能会更好,这样丢弃的数据包就会更少,因此整体质量会更好。采用哪种解决方案将取决于具体情况。
In this case, it might be possible to coordinate with the voice system administrator to use a voice codec requiring less bandwidth. If a new codec requiring only 16Kbps of bandwidth per call is deployed instead of the original 64Kbps, 32 VoIP conversations could be forwarded without loss through an LLQ allocated 512Kbps of bandwidth. The compromise? Voice quality. The human voice encoded at 64Kbps will sound more clear and natural when compared to one encoded at 16Kbps. It may also be better to encode at 16Kbps so fewer packets are dropped, and hence the overall quality is better. Which solution to apply will depend on the specific situation.
通过接口的流量可能会多于 LLQ 带宽上限指定的流量。如果 LLQ 服务的流量的带宽上限设置为最大 512Kbps,则该类中超过 512Kbps 的流量可能会通过该接口。仅当接口不拥塞时,这种编程行为才会表现出来。在原始示例中,如果使用 64Kbps 编解码器,则通过链路以 64Kbps 传输 10 个对话将导致 640Kbps 的语音流量穿过 1,024Kbps 容量的链路(剩余 1,024Kbps – 640Kbps = 384Kbps)。只要所有其他流量类别的总带宽利用率保持在 384Kbps 以下,则链路将保持无拥塞。如果链路不拥塞,则QoS策略不会生效。如果QoS策略没有生效,
It is possible for more traffic than specified by the LLQ bandwidth cap to pass through the interface. If the bandwidth cap for traffic serviced by the LLQ is set at a maximum of 512Kbps, it is possible for more than 512Kbps’ worth of traffic in the class to pass through the interface. This programmed behavior exhibits itself only if the interface is uncongested. In the original example, where a 64Kbps codec is being used, transmitting 10 conversations at 64Kbps over the link will result in 640Kbps’ worth of voice traffic traversing the 1,024Kbps capacity link (1,024Kbps – 640Kbps = 384Kbps left). As long as all other traffic classes stay below 384Kbps total bandwidth utilization, then the link will remain congestion-free. If the link is not congested, then QoS policies do not take effect. If the QoS policy is not in effect, then the LLQ bandwidth cap of 512Kbps does not impact the 640Kbps of aggregated voice traffic.
在 LLQ 的讨论中,上下文是语音流量,但请注意 LLQ 可以应用于所需的任何类型的流量。然而,在存在 VoIP 的网络中,VoIP 往往是 LLQ 服务的唯一流量。对于不存在 VoIP 流量的网络,LLQ 成为一种有趣的工具,可以保证其他类型应用流量的及时、低延迟和抖动传送。然而,LLQ 并不是 QoS 策略编写者可用的唯一工具。其他几个工具也很有用。
In this discussion of LLQ, the context has been that of voice traffic, but be aware that LLQ can be applied to any sort of traffic desired. However, in networks where VoIP is present, VoIP tends to be the only traffic serviced by LLQ. For networks where VoIP traffic is not present, LLQ becomes an interesting tool to guarantee timely, low delay and jitter delivery of other sorts of application traffic. However, LLQ is not the only tool available to the QoS policy writer. Several other tools are also useful.
当时序比实际传输更受关注时,通常可以通过基于类的加权公平队列 (CBWFQ) 技术来管理流量。在 CBWFQ 中,参与的流量类别根据分配给它们的策略提供服务。例如,标记为 AF41 的流量可能会得到最小带宽量的保证。标记为 AF21 的流量也可能得到最小带宽量的保证,可能小于为 AF41 流量提供的带宽量。未标记的流量可能会获得剩余的带宽。
When timing is of less concern than actual delivery, traffic can often be managed by the technique of class-based weighted fair queueing (CBWFQ). In CBWFQ, participating traffic classes are serviced in accordance with the policy assigned to them. For example, traffic marked as AF41 might be guaranteed a minimum amount of band-width. Traffic marked as AF21 might also be guaranteed a minimum amount of bandwidth, perhaps less than the amount given to AF41 traffic. Unmarked traffic might get whatever bandwidth is left over.
CBWFQ 具有公平的概念,即各种流量类别都有机会通过拥塞的链路传递。CBWFQ 确保队列中的数据包按照 QoS 策略以公平的方式得到服务。所有分配了带宽的流量类别都将发送数据包。
CBWFQ has the notion of fairness, where various traffic classes have a chance to be delivered across the congested link. CBWFQ ensures the packets in the queue are being serviced in a fair manner, in accordance with the QoS policy. All traffic classes with bandwidth assigned to them will have packets sent along.
例如,假设链路容量为 1,024Kbps。流量类别 AF41 已保证至少 256Kbps。AF31 级已保证至少 128Kbps。AF21 级已保证至少 128Kbps。这给了我们这三个类别之间的比例为 2:1:1。剩余的 512Kbps 未分配,这意味着它可供其他流量使用。包括未分配金额在内,完整比例为256:128:128:512,减少至2:1:1:4。
For example, assume a link of 1,024Kbps in capacity. Traffic class AF41 has been guaranteed a minimum of 256Kbps. Class AF31 has been guaranteed a minimum of 128Kbps. Class AF21 has been guaranteed a minimum of 128Kbps. This gives us a ratio of 2:1:1 among those three classes. The remaining 512Kbps is unallocated, meaning it is available for use by other traffic. Including the unallocated amount, the full ratio is 256:128:128:512, which reduces to 2:1:1:4.
为了决定接下来发送哪个数据包,根据 CBWFQ 策略为队列提供服务。此示例将 1,024Kbps 带宽分为四部分,比例为 2:1:1:4。为了简单起见,假设拥塞的接口将在八个时钟周期内为队列中的数据包提供服务:
To decide which packet is sent next, the queue is serviced in accordance with the CBWFQ policy. This example carves up the 1,024Kbps of bandwidth into four portions, with a ratio of 2:1:1:4. For simplicity’s sake, assume the congested interface will service the packets in the queue in eight clock cycles:
1.时钟周期 1.将发送 AF41 数据包。
1. Clock cycle 1. An AF41 packet will be sent.
2.时钟周期2。将发送另一个AF41 数据包。
2. Clock cycle 2. Another AF41 packet will be sent.
3.时钟周期3.将发送AF31数据包。
3. Clock cycle 3. An AF31 packet will be sent.
4.时钟周期4.将发送AF21数据包。
4. Clock cycle 4. An AF21 packet will be sent.
5.时钟周期 5-8。具有其他分类的数据包以及未分类的数据包将被发送。
5. Clock cycles 5–8. Packets with other classifications as well as unclassified packets will be sent.
此示例假设缓冲区中存在代表四个类别中每一个类别的数据包,排队等待发送。然而,情况并不总是那么简单。即使保证的最小带宽分配有空间,如果没有特定流量类别的数据包要发送,会发生什么情况?
This example assumes there are packets representing each of the four classes sitting in the buffer, queued to be sent. However, the situation is not always so straightforward. What happens when there are no packets from a particular traffic class to be sent, even though there is room in the guaranteed minimum bandwidth allocation?
保证带宽最小值不是预留。如果分配了保证最小值的流量类别不需要完全分配,则其他流量类别可以使用该带宽。两者都没有保证带宽最小硬限制。如果特定类别的流量超过保证的最小值并且带宽可用,则该类别的流量将以更快的速率流动。
Guaranteed bandwidth minimums are not reservations. If the traffic class assigned the guaranteed minimum does not require the full allocation, other traffic classes could use the bandwidth. Neither are guaranteed bandwidth minimums hard limits. If the amount of traffic for a specific class exceeds the guaranteed minimum and bandwidth is available, traffic for the class will flow at a faster rate.
因此,发生的情况可能更像是这样:
Thus, what happens could look more like this:
1.时钟周期 1.发送AF41数据包。
1. Clock cycle 1. An AF41 packet is sent.
2、时钟周期2。没有AF41报文需要发送,所以发送AF31报文。
2. Clock cycle 2. There is no AF41 packet to be sent, so an AF31 packet is sent instead.
3.时钟周期3.发送另一个AF31数据包。
3. Clock cycle 3. Another AF31 packet is sent.
4、时钟周期4、没有AF21报文要发送,所以发送的是未分类报文。
4. Clock cycle 4. There is no AF21 packet to be sent, so an unclassified packet is sent.
5.时钟周期 5-7。发送其他分类的数据包以及未分类的数据包。
5. Clock cycles 5–7. Packets with other classifications as well as unclassified packets are sent.
6.时钟周期 8.不再有其他分类或未分类的数据包要发送,因此又发送了另一个 AF31 数据包。
6. Clock cycle 8. There are no more otherwise classified or unclassified packets to be sent, so yet another AF31 packet is sent.
结果,未使用的带宽被分配给具有过量流量的类别。
As a result, unused bandwidth is divided up among the classes with excess traffic.
CBWFQ 不会增加拥塞链路的吞吐量。相反,该算法旨在以反映各种流量类别的相对重要性的方式仔细控制过载链路的共享。CBWFQ 共享的结果是通过拥塞链路传输流量,但与非拥塞时间的相同链路相比,速率降低。
CBWFQ does not increase throughput of a congested link. Rather, the algorithm is about carefully controlled sharing of the overstressed link in a way reflecting the relative importance of various traffic classes. The result of CBWFQ sharing is traffic being delivered via the congested link, but at a reduced rate when compared to the same link at an uncongested time.
“共享压力过大的链路”和“从无到有创造带宽”之间的区别怎么强调都不为过。关于 QoS 的一个常见误解是,尽管网络路径中存在拥塞点,但用户体验将保持不变。事实并非如此。CBWFQ 等 QoS 工具在很大程度上是为了在糟糕的情况下充分利用。QoS在选择转发哪些流量时,也在选择丢弃哪些流量;通过网络传输的流量有“赢家”和“输家”。
The distinction between “sharing an overstressed link” and “creating bandwidth from nothing” cannot be overstated. A common misconception about QoS is, despite points of congestion in a network path, user experience will remain identical. This just is not the case. QoS tools like CBWFQ are, for the most part, about making the best of a bad situation. In picking which traffic is forwarded when, QoS is also choosing which traffic to drop; there are “winners” and “losers” among the flows transmitted across the network.
LLQ 是一个值得注意的例外,因为 LLQ 服务的流量被认为是绝对关键的,以至于在分配的带宽限制内,将排除其他流量来提供服务。LLQ 致力于保护用户体验。
LLQ is a notable exception because traffic serviced by an LLQ is assumed to be so absolutely critical that it will be serviced to the exclusion of other traffic, up to the bandwidth limitation assigned. LLQ seeks to preserve user experience.
流量整形是一种将流量类别优雅地限制为特定速率的方法。例如,标记为 AF21 的流量可能会调整为 512Kbps。造型优美;它允许在丢弃数据包之前超出定义限制的标称突发。这使得 TCP 能够更轻松地调整到所需的速率。当整形流量类别的吞吐量绘制成图表时,结果显示速度逐渐上升到限制,然后在流量持续时间内保持平稳、一致的传输速度。流量整形最常应用于由大象流填充的流量类别。
Traffic shaping is a way to gracefully cap traffic classes to a specific rate. For example, traffic marked as AF21 might be shaped to 512Kbps. Shaping is graceful; it allows for nominal bursts above the defined limit before dropping packets. This allows TCP to adjust more easily to the required rate. When the throughput of a shaped traffic class is graphed, the result shows a ramp-up to the speed limit, and then a flat, consistent transfer speed for the duration of the flow. Traffic shaping is most often applied to traffic classes populated by elephant flows.
大象流是长期存在的流量,用于尽快在两个端点之间移动大量数据。大象流能够用自己的流量填充网络瓶颈,压缩较小的流量。常见的 QoS 策略是调整大象流的流量速率,以便为瓶颈链路留下足够的带宽来有效地服务其他流量类别。
Elephant flows are long-lived traffic flows used to move large amounts of data between two endpoints as quickly as possible. Elephant flows have the ability to fill network bottlenecks with their own traffic, squashing smaller flows. A common QoS strategy is to shape the traffic rate of elephant flows so it will leave the bottleneck link with enough bandwidth to effectively service other traffic classes.
监管与流量整形类似,但对过量(不合格)流量的处理更加严厉。监管会立即丢弃多余的流量,而不是像整形之前那样允许超出定义的带宽上限的小突发。当面对监管器时,受影响的流量会上升到带宽限制,超过并被丢弃。此丢弃行为会导致 TCP 重新启动启动过程。生成的图形看起来像锯齿。监管可用于完成其他任务,例如将不合格流量重新标记为较低优先级的 DSCP 值,而不是丢弃。
Policing is similar to traffic shaping but treats excess (nonconforming) traffic more harshly. Rather than allowing a small burst above the defined bandwidth cap like shaping does before dropping, policing drops excess traffic immediately. When facing a policer, impacted traffic ramps up to the bandwidth limit, exceeds, and is dropped. This drop behavior causes TCP to start the ramp-up process over again. The resulting graph looks like a sawtooth. Policing can be used to accomplish other tasks, such as re-marking nonconforming traffic to a lower priority DSCP value, rather than dropping.
缓冲数据包来处理拥塞的接口似乎是一个不错的主意。事实上,缓冲区对于处理由于突发或接口速度不匹配而到达太快的流量是必要的——例如,从高速 LAN 转移到低速 WAN。到目前为止,对 QoS 的讨论主要集中在对这些缓冲区中排队的数据包进行分类、优先级排序,然后根据策略进行转发。将缓冲区调整得尽可能大似乎是一件好事。理论上,如果缓冲区足够大,可以将数据包排队,从而压垮链路,那么所有数据包最终都会被传送。然而,大缓冲区和满缓冲区都会带来需要处理的问题。
Buffering packets to deal with a congested interface seems like a lovely idea. Indeed, buffers are necessary to handle traffic arriving too fast due to bursts or interface speed mismatches—moving from a high-speed LAN to lower-speed WAN, for instance. Thus far, this discussion of QoS has been focused on classifying, prioritizing, and then forwarding packets queued in those buffers in accordance with a policy. Sizing buffers as large as possible might seem like a good thing. In theory, if a buffer is large enough to queue up packets overwhelming a link, all packets will eventually be delivered. However, both large buffers and full buffers introduce problems to be dealt with.
当数据包位于缓冲区中时,它们就会被延迟。当数据包位于缓冲区等待传送时,数据包在源和目的地之间的旅程中会添加一些微秒甚至毫秒的时间。延迟传输对于某些网络会话来说是很麻烦的,因为 TCP 使用的算法假设发送者和接收者之间的延迟量是可预测的,并且理想情况下延迟量较低。
When packets are in a buffer, they are being delayed. Some number of microseconds or even milliseconds are being added to the packet’s journey between source and destination while they sit in a buffer waiting to be delivered. Delayed travel is troublesome for some network conversations, as the algorithms employed by TCP assume a predictable, and ideally low, amount of delay between sender and receiver.
在主动队列管理类别下,您会发现管理队列内容的不同方法。有些方法解决队列已满的问题,丢弃足够多的数据包,为新到达的数据包留出一点空间。其他方法则应对延迟的挑战,保持较浅的队列深度,最大限度地减少数据包在缓冲区中花费的时间。这使缓冲延迟保持合理,允许 TCP 将流量速度调整到适合拥塞接口的速率。
Under the category of active queue management, you will find different methods for managing the contents of the queue. Some methods go after the problem of a full queue, dropping enough packets to leave a little room for new arrivals. Other methods go after the challenge of delay, maintaining shallow queue depths, minimizing the amount of time a packet spends in a buffer. This keeps buffered delay reasonable, allowing TCP to adjust traffic speed to a rate appropriate for the congested interface.
随机早期检测(RED)帮助我们处理队列已满的问题。缓冲区的大小不是无限的;分配给每个人的内存是有限的。当缓冲区填满数据包时,新到达的数据将被尾部丢弃。这对于 VoIP 等关键流量来说不是一个好兆头,因为丢弃此类流量不会影响用户体验。处理这个问题的方法是确保缓冲区永远不会完全满了。如果缓冲区从未完全填满,则始终有空间接受额外的流量。
Random early detection (RED) helps us deal with the problem of a full queue. Buffers are not infinite in size; there is only so much memory allocated to each one. When the buffer is filled with packets, then the new arrivals are tail dropped. This does not bode well for critical traffic like VoIP, which cannot be dropped without impacting the user experience. The way to handle this problem is to ensure the buffer is never entirely full. If the buffer is never completely full, then there is always room to accept additional traffic.
为了防止缓冲区已满,RED 使用主动丢弃选定入站流量的方案,保持空间开放。缓冲区越满,传入数据包被丢弃的可能性就越大。RED 是加权随机早期检测 (WRED) 等现代变体的前身。WRED 根据其标记考虑传入流量的优先级。较高优先级的流量不太可能被丢弃。优先级较低的流量更有可能被丢弃。如果流量使用某种形式的窗口传输(例如 TCP),这些丢弃将被解释为拥塞,向发送器发出减慢速度的信号。
To prevent a full buffer, RED uses a scheme of proactively dropping selected inbound traffic, keeping spaces open. The more full the buffer gets, the more likely an incoming packet is to be dropped. RED is the predecessor to modern variations such as Weighted Random Early Detection (WRED). WRED takes into consideration the priority of the incoming traffic based on its mark. Higher priority traffic is less likely to be dropped. Lower priority traffic is more likely to be dropped. If the traffic is using some form of windowed transport, such as TCP, these drops will be interpreted as congestion, signaling the transmitter to slow down.
RED 及其变体还可以解决TCP 同步问题。如果没有 RED,所有入站数据包在缓冲区已满的情况下都会被尾部丢弃。对于 TCP 流量,尾部丢弃导致的数据包丢失会导致传输速度下降,并且丢失的数据包需要重新传输。一旦数据包再次被传送,TCP 将尝试恢复到更快的速率。如果此循环同时发生在许多不同的会话中,就像在无 RED 尾部丢弃场景中发生的那样,接口可能会经历带宽利用率振荡,其中链路从拥塞(和尾部丢弃)变为不拥塞且未充分利用。被抑制的 TCP 会话开始加速恢复。当现在同步的 TCP 会话再次足够快地通话时,链路再次拥塞,并且循环重复。
RED and variations also manage the problem of TCP synchronization. Without RED, all inbound packets are tail dropped in the presence of a full buffer. For TCP traffic, the packet loss resulting from the tail drop causes transmission speed to throttle back and the lost packets to be retransmitted. Once packets are being delivered again, TCP will attempt to ramp back up to a faster rate. If this cycle happens across many different conversations at the same time, as happens in a RED-free tail-drop scenario, the interface can experience bandwidth utilization oscillations where the link goes from congested (and tail dropping) to uncongested and underutilized as all of the throttled-back TCP conversations start to speed back up. When the now-synchronized TCP conversations are talking quickly enough again, the link is again congested, and the cycle repeats.
RED 通过在选择要丢弃的数据包时利用随机性来解决 TCP 同步问题。并非所有 TCP 会话都会丢弃数据包。只有某些对话才会由 RED 随机选择。流经拥塞链路的 TCP 会话永远不会同步,从而避免了振荡。链路利用率更加稳定。
RED addresses the TCP synchronization issue by leveraging randomness when selecting which packets to drop. Not all TCP conversations will have packets dropped. Only certain conversations will, randomly selected by RED. The TCP conversations flowing through the congested link never end up synchronized, and the oscillation is avoided. Link utilization is more steady.
此时可能会出现一个明显的问题。如果丢包是一件坏事,为什么不让缓冲区足够大来处理拥塞呢?如果缓冲区更大,则可以排队更多数据包,也许您可以避免这种讨厌的数据包丢失问题。事实上,这种大缓冲区策略已经应用于各种网络设备和一些网络工程方案中。然而,当链路拥塞导致缓冲区填满并保持填满状态时,大缓冲区被称为“臃肿”。这种现象在网络行业众所周知,它有一个名字:缓冲区膨胀。
An obvious question might arise at this point. If packet loss is a bad thing, why not make the buffers big enough to handle congestion? If the buffers are bigger, more packets can be queued up, and maybe you can avoid this pesky problem of packet loss. In fact, this strategy of sizable buffers has found its way into various network devices and some network engineering schemes. However, when link congestion causes buffers to fill and stay filled, the large buffer is said to be bloated. This phenomenon is so well known in the networking industry, it has a name: bufferbloat.
缓冲膨胀具有负面含义,因为它是好事太多的例子。缓冲器很好。缓冲区提供了一点余地,可以在出口接口赶上时为突发数据包提供停留的地方。为了处理小流量突发,缓冲区是必要的,但要权衡引入延迟;然而,缓冲区过大并不能弥补链接过小。链路具有特定的承载能力。如果链路长期被要求传输比其能够承载的更多的数据,则它不适合所要求的任务。再多的缓冲也无法解决基本的网络容量问题。
Bufferbloat has a negative connotation because it is an example of too much of a good thing. Buffers are good. Buffers provide a bit of leeway to give a burst of packets somewhere to stay while an egress interface catches up. To handle small bursts of traffic, buffers are necessary, with the critical tradeoff of introducing delay; however, oversizing buffers does not make up for undersizing a link. A link has a specific amount of carrying capacity. If the link is chronically asked to transmit more data than it is able to carry, then it is ill suited to the task required of it. No amount of buffering can overcome a fundamental network capacity issue.
将缓冲区深度增加得更大并不会提高链路吞吐量。事实上,不断填充的缓冲区会使拥塞的接口承受更大的压力。考虑几个示例,对比未确认数据报协议 (UDP) 和传输控制协议 (TCP)。
Increasing the depth of a buffer ever larger does not improve link throughput. In fact, a constantly filled buffer puts a congested interface under an even greater strain. Consider a couple of examples, contrasting Unacknowledged Datagram Protocol (UDP) and Transmission Control Protocol (TCP).
1. 在 VoIP 流量的情况下,缓冲的数据包到达较晚。死气沉沉的空气会对实时语音对话造成极大的干扰。VoIP 是通过 IP 上的 UDP 传输流量的一个示例。UDP 流量未得到确认。发送方发送 UDP 数据包,而不关心它们是否到达目的地。如果目标主机未收到 UDP 数据包,则不会重传数据包。对于 VoIP,数据包要么按时到达,要么不按时到达。如果没有,那么重传就没有意义,因为已经太晚了,已经不重要了。说话的人类已经继续前进。
1. In the case of VoIP traffic, buffered packets arrive late. Dead air is enormously disruptive to a real-time voice conversation. VoIP is an example of traffic transported via UDP over IP. UDP traffic is unacknowledged. The sender sends the UDP packets along with no concern about whether they make it to their destination. There is no retransmission of packets if the destination host does not receive a UDP packet. In the case of VoIP, the packet arrives on time, or it does not. If it does not, then there is no point in retransmitting it, because it is far too late to matter. The humans doing the talking have moved on.
您可能会想到 LLQ 作为此问题的答案,但问题的一部分是缓冲区过大。即使 LLQ 正在为 VoIP 流量提供服务,较大的缓冲区也会花费一些时间来提供服务,从而导致 VoIP 流量传输延迟。丢弃队列中时间过长的 VoIP 流量比发送得太晚要好。
LLQ might come to your mind as the answer to this problem, but part of the issue is the oversized buffer. A large buffer will take time to service causing delay in the VoIP traffic delivery, even if LLQ is servicing the VoIP traffic. It would be better to drop VoIP traffic sitting in the queue too long than send it too late.
2. 对于大多数应用程序流量,流量是通过 IP 上的 TCP 传输的,而不是 UDP。TCP 已确认。TCP 流量发送方在发送更多流量之前等待接收方确认收到。在缓冲区膨胀情况下,数据包在拥塞接口的已满且过大的缓冲区中停留的时间过长,从而延迟了数据包向接收方的传送。接收方获取数据包并发送确认。确认到达发件人的速度很慢,但它确实到达了。TCP 并不关心数据包需要多长时间才能到达,只要它到达即可。这样,发送方就会通过拥塞的接口以相同的速度发送流量,导致超大的缓冲区充满,延迟时间较长。
2. In the case of most application traffic, the traffic is transported via TCP over IP, rather than UDP. TCP is acknowledged. A TCP traffic sender waits for the receiver to acknowledge receipt before more traffic is sent. In a bufferbloat situation, a packet sits in the full, oversized buffer of a congested interface for an overly long time, delaying the delivery of the packet to the receiver. The receiver gets the packet and sends an acknowledgment. The acknowledgment was slow in arriving at the sender, but it did arrive. TCP does not care how long it takes for the packet to arrive, so long as it gets there. And thus, the sender keeps sending traffic at the same speed through the congested interface, which keeps the oversized buffer full and the delay times long.
在极端情况下,发送方甚至可能会重新传输数据包,而原始数据包仍位于缓冲区中。拥塞的接口最终将原始缓冲的数据包发送到接收器,而同一数据包的第二个副本现在正在传输中,这给已经拥塞的接口带来了更大的压力!
In extreme cases, the sender might even retransmit the packet, while the original packet is still sitting in the buffer. The congested interface finally sends the original buffered packet to the receiver, with a second copy of the same packet now in flight, putting even more strain on an already congested interface!
这些场景说明,大小不合适的缓冲区实际上并不好。缓冲区的大小必须适当,以适应其服务的接口的速度以及可能通过它的应用程序流量的性质。
These scenarios illustrate inappropriately sized buffers are, in fact, not good. A buffer must be appropriately sized both for the speed of the interface it services and the nature of the application traffic likely to pass through it.
网络行业为应对沿某些网络路径发现的过大缓冲区的一种尝试是控制延迟(CoDel)。CoDel 假定缓冲区过大,但通过监视数据包在队列中的停留时间来管理数据包延迟。这称为逗留时间。当数据包停留时间超过计算的理想值时,数据包将被丢弃。这意味着队列头部的数据包(等待时间最长的数据包)将在当前位于队列尾部的数据包之前被丢弃。
One attempt on the part of the networking industry to cope with the oversized buffers found along certain network paths is controlled delay, or CoDel. CoDel assumes an oversized buffer but manages packet delay by monitoring how long a packet has been in the queue. This is known as the sojourn time. When the packet sojourn time has exceeded the computed ideal, the packet is dropped. This means packets at the head of the line—those that have waited the longest—are going to be dropped before packets currently at the tail end of the queue.
CoDel 对丢弃数据包的积极态度使 TCP 流量控制机制能够按预期工作。数据包不会在传输过程中遭受高延迟,而是在延迟变得太长之前被丢弃。丢弃迫使 TCP 发送方重新传输数据包并减慢传输速度,这对于拥塞的接口来说是非常理想的结果。聚合结果是为竞争接口的流量分配更均匀的带宽。
CoDel’s aggressive stance toward dropping packets allows TCP flow control mechanisms to work as intended. Rather than packets suffering from high delay while still being delivered, they are dropped before the delay gets too long. The drop forces a TCP sender to retransmit the packet and slow down the transmission, a strongly desirable result for a congested interface. The aggregate result is a more even distribution of bandwidth to traffic flows contending for the interface.
在早期实施中,CoDel 一直在无参数消费者边缘设备中发货。假定有关互联网的某些默认设置。假设包括发送方和接收方之间的往返时间为 100 毫秒或更短,并且缓冲数据包允许的最大延迟为 5 毫秒。这种无参数配置使消费级网络设备供应商更容易纳入其中。消费者网络是 CoDel 的重要目标,因为高速家庭网络和低速宽带网络的不匹配会导致自然拥塞点。此外,消费级网络设备经常会遇到缓冲区过大的问题。
In early implementations, CoDel has been shipping in consumer-edge devices parameterless. Certain defaults about the Internet are assumed. Assumptions include a 100ms or less roundtrip time between senders and receivers, and a 5ms delay is the maximum allowed for a buffered packet. This parameterless configuration makes it easier for vendors of consumer-grade network gear to include. Consumer networks are an important target for CoDel, as the mismatch of high-speed home networks and lower-speed broadband networks causes a natural congestion point. In addition, consumer-grade network gear often suffers from oversized buffers.
服务质量是一个深奥的话题;人们进行了大量的研究来了解流量如何对特定网络条件做出反应,以及网络设备应如何处理排队和数据包处理,以确保即使在最糟糕的网络情况下,也能确保丢弃最少量的流量,并将延迟和抖动降至最低状况。为了成为一名有效的网络工程师,您需要了解 QoS 的几个广泛领域,包括数据包分类、数据包标记、跨不同网络的数据包标记转换以及队列处理。其中每一个都以并不总是显而易见的方式与传输协议交互。
Quality of Service is a deep topic; a lot of research has been done in understanding how flows react to specific network conditions, and how network devices should handle queueing and packet processing to ensure the minimal amount of traffic is dropped, and delay and jitter are minimized, under even the worst of network conditions. There are several broad areas of QoS you need to understand in order to be an effective network engineer, including packet classification, packet marking, translation of packet marking across different networks, and queue processing. Each of these interacts with the transport protocols in ways that are not always obvious.
下一章将深入探讨网络工程完全不同领域的主题——虚拟化。通过解决本书前面几章中考虑的问题集,虚拟化在跨单个物理拓扑复用多个虚拟拓扑方面发挥了作用。
The next chapter will dive into a topic from a completely different realm of net-work engineering—virtualization. Working through the problem set considered in the early chapters of this book, virtualization plays a role in multiplexing multiple virtual topologies across a single physical topology.
弗雷德·贝克、大卫·L·布莱克、凯瑟琳·M·尼科尔斯博士和史蒂文·L·布莱克。IPv4 和 IPv6 标头中差分服务字段(DS 字段)的定义。征求意见 2474。RFC 编辑,1998。doi:10.17487/RFC2474。
Baker, Fred, David L. Black, Dr. Kathleen M. Nichols, and Steven L. Blake. Definition of the Differentiated Services Field (DS Field) in the IPv4 and IPv6 Headers. Request for Comments 2474. RFC Editor, 1998. doi:10.17487/RFC2474.
贝克、弗雷德和戈里·费尔赫斯特。IETF 关于主动队列管理的建议。征求意见 7567。RFC 编辑,2015。doi:10.17487/RFC7567。
Baker, Fred, and Gorry Fairhurst. IETF Recommendations Regarding Active Queue Management. Request for Comments 7567. RFC Editor, 2015. doi:10.17487/RFC7567.
Bennett、Jon、Shahram Davari、Dimitrios Stiliadis、William Courtney、Kent Benson、Jean-Yves Le Boudec、Victor Firoiu、Bruce S. Davie 博士和 Anna Charny。快速转发 PHB(每跳行为)。征求意见 3246。RFC 编辑,2002。doi:10.17487/RFC3246。
Bennett, Jon, Shahram Davari, Dimitrios Stiliadis, William Courtney, Kent Benson, Jean-Yves Le Boudec, Victor Firoiu, Dr. Bruce S. Davie, and Anna Charny. An Expedited Forwarding PHB (Per-Hop Behavior). Request for Comments 3246. RFC Editor, 2002. doi:10.17487/RFC3246.
Bollapragada、维杰、拉斯·怀特和柯蒂斯·墨菲。Cisco IOS 软件架构内部。印第安纳州印第安纳波利斯:思科出版社,2000 年。
Bollapragada, Vijay, Russ White, and Curtis Murphy. Inside Cisco IOS Software Architecture. Indianapolis, IN: Cisco Press, 2000.
弗洛伊德,S.,和 V.雅各布森。“用于避免拥塞的随机早期检测网关。” IEEE/ACM 网络交易1,编号。4(1993 年 8 月):397-413。doi:10.1109/90.251892。
Floyd, S., and V. Jacobson. “Random Early Detection Gateways for Congestion Avoidance.” IEEE/ACM Transactions on Networking 1, no. 4 (August 1993): 397–413. doi:10.1109/90.251892.
盖蒂斯、吉姆和凯瑟琳·尼科尔斯。“Bufferbloat:互联网中的黑暗缓冲区。” ACM 队列,2011 年 11 月。http://queue.acm.org/detail.cfm? id=2071893 。
Gettys, Jim, and Kathleen Nichols. “Bufferbloat: Dark Buffers in the Internet.” ACM Queue, November 2011. http://queue.acm.org/detail.cfm?id=2071893.
“网络的历史——Fred Baker——QoS 和 DS Bit。” Network Collective,2017 年 8 月 2 日。http: //thenetworkcollective.com/2017/08/hon-fred-baker-qos/。
“History Of Networking—Fred Baker—QoS & DS Bit.” Network Collective, August 2, 2017. http://thenetworkcollective.com/2017/08/hon-fred-baker-qos/.
尼科尔斯、凯瑟琳. “控制队列延迟。” ACM 队列,2012 年 5 月。http://queue.acm.org/detail.cfm? id= 2209336。
Nichols, Kathleen. “Controlling Queue Delay.” ACM Queue, May 2012. http://queue.acm.org/detail.cfm?id=2209336.
波斯特尔,J.,编辑。互联网协议。征求意见 791。RFC 编辑,1981。doi:10.17487/rfc791。
Postel, J., ed. Internet Protocol. Request for Comments 791. RFC Editor, 1981. doi:10.17487/rfc791.
“不同光线下的红色。” Jg 的随笔,2010 年 12 月 17 日。https ://gettys.wordpress.com/2010/12/17/red-in-a- Different-light / 。
“RED in a Different Light.” Jg’s Ramblings, December 17, 2010. https://gettys.wordpress.com/2010/12/17/red-in-a-different-light/.
斯里坎特,拉亚杜尔加姆。互联网拥塞控制的数学。2004年版。马萨诸塞州波士顿:Birkhäuser,2003 年。
Srikant, Rayadurgam. The Mathematics of Internet Congestion Control. 2004 edition. Boston, MA: Birkhäuser, 2003.
斯特林菲尔德、纳基亚、拉斯·怀特和斯塔西娅·麦基。思科快速转发。第一版。印第安纳州印第安纳波利斯:思科出版社,2007 年。
Stringfield, Nakia, Russ White, and Stacia McKee. Cisco Express Forwarding. 1st edition. Indianapolis, IN: Cisco Press, 2007.
Weiss、Walter、Juha Heinanen 博士、Fred Baker 和 John T. Wroclawski。放心转发 PHB 集团。征求意见 2597。RFC 编辑,1999。doi:10.17487/rfc2597。
Weiss, Walter, Dr. Juha Heinanen, Fred Baker, and John T. Wroclawski. Assured Forwarding PHB Group. Request for Comments 2597. RFC Editor, 1999. doi:10.17487/rfc2597.
1. 有时部署 QoS 是为了应对在同一链路上运行文件传输协议(例如 FTP 或备份程序)和实时流应用程序(例如 IP 语音)的影响。为什么这两种应用程序在单个队列中交互效果不佳?提示:数据包大小很重要。
1. QoS is sometimes deployed to counter the impact of running a File Transfer Protocol, such as FTP or a backup program, and a real-time streaming application, such as voice over IP, over the same link. Why do these two kinds of application interact poorly in a single queue? A hint: packet sizes matter.
2. 本章指出 TCP 发送流量直到遇到拥塞,然后退出。TCP 中的什么机制导致了这种效果?如果单个队列中包含数据包的大量 TCP 会话同时丢弃一个数据包,会发生什么情况?
2. The chapter notes that TCP sends traffic until it encounters congestion and then backs off. What mechanism in TCP causes this effect? What happens if a large number of TCP sessions with packets in a single queue all have a single packet dropped at the same time?
3. WRED 如何尝试减轻同时在一组 TCP 流中丢弃数据包的影响?
3. How does WRED try to mitigate the effect of dropping packets across a set of TCP flows at the same time?
4. 跟踪 IPv6 标头中的 ToS 位转换为 MPLS 标头,然后从 MPLS 标头转换为以太网标头的方式。这些翻译中哪些地方丢失了信息?
4. Trace the way in which the ToS bits in an IPv6 header are translated into an MPLS header and then from an MPLS header to an Ethernet header. In what places is information lost in these translations?
5. 一些供应商建议在网络的不同部分使用相同的 DSCP 值来表示不同的服务类别或类型。您同意这个建议吗?它增加了哪些复杂性,又在哪些方面使事情变得更简单?
5. Some vendors have recommended the same DSCP values be used in different parts of the network to express different classes or types of service. Would you agree with this recommendation? What complexities does it add, and where does it make things simpler?
6. 您可能会将哪些类型的流量置于高优先级类别中,为什么?拾荒者类别有哪些种类,为什么?
6. What kinds of traffic might you place into a high-priority class, and why? What kinds in a scavenger class, and why?
7. 根据状态/优化/表面三向权衡,添加状态应该会增加优化,同时也会增加复杂性等。考虑在网络中添加更多服务类别的情况。描述附加状态、增强优化之间的权衡,以及网络中不同协议层之间的交互表面可能受到影响的位置。
7. According to the State/Optimization/Surface three-way tradeoff, adding state should increase optimization while also increasing complexity, etc. Consider the case of adding more classes of service in a network. Describe the tradeoffs between additional state, increased optimization, and where the interaction surfaces between the different layers of protocols in the network might be impacted.
8. 流量工程是在网络中实现服务质量的一种完全不同的方法。您可以使用流量工程来解决所有网络中的所有服务质量问题吗?描述一种网络工程情况或拓扑,其中流量工程似乎能够解决大多数 QoS 要求,而无法解决。
8. Traffic engineering is a completely different way to implement Quality of Service in a network. Can you use traffic engineering to resolve all Quality of Service problems in all networks? Describe a network engineering situation or topology in which it seems like traffic engineering would be able to solve most QoS requirements and one where it would not.
9. 通常建议将多少百分比的流量放置在 LLQ 系统中的低延迟队列中?解释为什么。
9. What percentage of traffic is generally recommended to be placed in the low-latency queue in an LLQ system? Explain why.
10. SD-WAN 如何“摆脱人类”管理 QoS 等级和服务类型的复杂性?这种方法有哪些优点和缺点?
10. How does SD-WAN take the complexity of managing QoS Class and Type of Service “out of the hands of humans”? What are the advantages and disadvantages of such an approach?
用最简单的术语来说,网络虚拟化是在物理拓扑之上创建逻辑拓扑。这些逻辑拓扑通常称为虚拟拓扑,因此出现了网络虚拟化的概念。这些拓扑可能由跨较大网络的单个虚拟链路(称为隧道)或看似物理网络之上的完整网络的虚拟链路集合(称为覆盖)组成。
Network virtualization is, in the simplest terms possible, the creation of logical topologies built on top of a physical topology. These logical topologies are often called virtual topologies—hence the concept of network virtualization. These topologies may consist of a single virtual link across a larger network, called a tunnel, or a collection of virtual links that appear to be a complete network on top of the physical network, called an overlay.
本章将首先讨论为什么创建和使用虚拟拓扑,并通过两个用例进行说明。本章的第二部分将考虑任何虚拟化解决方案必须解决的问题,第三部分将考虑复杂性和网络虚拟化。接下来,将考虑虚拟化技术的两个示例:分段路由 (SR) 和软件定义的广域网 (SD-WAN)。
This chapter will begin with a discussion about why virtual topologies are created and used, illustrated by two use cases. The second section of this chapter will consider the problems any virtualization solution must solve, and the third section will consider complexity and network virtualization. Following this, two examples of virtualization technologies will be considered: segment routing (SR) and Software-Defined Wide Area Networks (SD-WAN).
虚拟化增加了协议设计、网络设计和故障排除的复杂性,那么为什么要进行虚拟化呢?原因往往归结为在单个物理网络上分离多个流量。这听起来可能有点像另一种形式的多路复用,因为它是另一种形式的多路复用。到目前为止所考虑的多路复用形式与虚拟化之间的主要区别是
Virtualization adds complexity in protocol design, network design, and troubleshooting, so why virtualize? The reasons tend to reduce to separating multiple traffic flows across a single physical network. This might sound suspiciously like another form of multiplexing because it is another form of multiplexing. The primary differences between the forms of multiplexing considered to this point and virtualization are
• 允许多个控制平面在单个物理拓扑上使用不同的可达性信息集进行操作
• Allowing multiple control planes to operate with different sets of reachability information across a single physical topology
• 允许多组可到达的目的地在单个物理拓扑上运行,而无需彼此交互
• Allowing multiple sets of reachable destinations to operate across a single physical topology without interacting with one another
到目前为止所考虑的多路复用技术主要集中在允许多个设备使用单个物理网络(或一组线路),允许每个设备与其他每个设备通信(只要它们从可达性角度了解彼此)。虚拟化的重点是将单个物理网络分解为多个可达性域,其中可达性域内的每个设备都可以与同一可达性域内的所有其他设备通信,但设备不能跨可达性域进行通信(除非可达性域之间存在某些连接点)域)。
The multiplexing techniques considered to this point have focused on allowing multiple devices to use a single physical network (or set of wires), allowing every device to talk to every other device (so long as they know about one another from a reachability perspective). Virtualization focuses on breaking up the single physical network into multiple reachability domains, where every device within a reachability domain can communicate with every other device within the same reachability domain, but devices cannot communicate across reachability domains (unless there is some connection point between the reachability domains).
图 9-1说明了在物理拓扑之上具有虚拟拓扑的网络。
Figure 9-1 illustrates a network with a virtual topology laid on top of the physical topology.
在图 9-1中,在物理网络之上创建了一个虚拟拓扑,并创建了虚拟链路 [C,H] 以在网络上传输流量。为了创建虚拟拓扑,C 和 H 必须具有某种将物理拓扑与虚拟拓扑分开的本地转发信息,这些信息通常会通过 E 或 D。这通常采用一组特殊的本地路由表中的虚拟接口条目,或仅包含有关虚拟拓扑信息的虚拟路由和转发 (VRF) 表。
In Figure 9-1, a virtual topology has been created on top of the physical network, with the virtual link [C,H] created to carry traffic across the network. In order to create the virtual topology, C and H must have some sort of local forwarding information separating the physical topology from the virtual topology, which would normally pass through either E or D. This would normally take the form of either a special set of virtual interface entries in the local routing table, or a Virtual Routing and Forwarding (VRF) table containing only information about the virtual topology.
考虑通过虚拟拓扑的数据包流有助于理解这些概念。如果 C 和 H 具有虚拟接口,数据包流会是什么样子?图 9-2说明了这一点。
Considering the packet flow through the virtual topology can be helpful in understanding the concepts. What would the packet flow look like if C and H had virtual interfaces? Figure 9-2 illustrates.
如图9-2所示,转发流程如下:
In Figure 9-2, the forwarding process follows these steps:
1. A 向 M 发送数据包。
1. A transmits a packet toward M.
2. C 收到该数据包,检查其本地路由表,发现到达目的地的最短路径是通过通往 H 的虚拟接口。该虚拟接口通常称为隧道接口;从路由表的角度来看,它看起来就像路由器上的任何其他接口一样。
2. C receives this packet, and, examining its local routing table, finds the shortest path to the destination is through a virtual interface toward H. This virtual interface is normally called a tunnel interface; it appears, from the routing table’s perspective, like any other interface on the router.
3.需要传输数据包的虚拟接口具有重写指令,包括将新报头、隧道报头或外部报头添加到数据包上,并转发所得数据包。原始数据包标头现在称为内部标头。C 添加外层报头并处理新数据包以进行转发。
3. The virtual interface through which the packet needs to be transmitted has rewrite instructions that include adding a new header, the tunnel header, or outer header, onto the packet, and forwarding the resulting packet. The original packet header is now called the inner header. C adds the outer header and processes the new packet for forwarding.
4. C 现在检查新目的地,即 H(记住原始目的地是 M)。H 没有直接连接,因此 C 需要查找如何到达 H。这称为递归查找,因为 C 正在查找到中间目的地的路径,以将数据包带向最终目的地,但不是到达最终目的地。
4. C now examines the new destination, which is H (remember the original destination was M). H is not directly connected, so C needs to look up how to reach H. This is called a recursive lookup, as C is looking for the path to an intermediate destination to take the packet toward, but not to, the final destination.
5. C 现在将正确的信息放入数据包的链路本地标头中,以将流量转发到 E。
5. C will now place the correct information onto the packet, in a link local header, to forward the traffic to E.
6. 当 E 收到此数据包时,它将在初始查找期间剥离外部转发信息、链路本地标头,并根据放置在数据包上的第一个标头 C 转发流量。该外部标头告诉 E 将数据包转发给 H;E 没有看到或打开 A 放置在数据包上的原始内部标头。
6. When E receives this packet, it will strip the outer forwarding information, the link local header, and forward the traffic based on the first header C placed on the packet, during the initial lookup. This outer header tells E to forward the packet to H; E does not see or switch on the original inner header placed on the packet by A.
7. E 将添加新的链路本地标头,以便数据包将正确转发到 H,并在正确的接口上传输数据包。
7. E will add a new link local header so the packet will be correctly forwarded to H, and transmit the packet on the correct interface.
8. 当H收到数据包时,它将剥离链路本地报头并发现外层报头。外层报头表明该数据包的目的地是H本身,因此H将剥离该报头,并发现原始数据包报头或内层报头。
8. When H receives the packet, it will strip the link local header and discover the outer header. The outer header says the packet is destined for H itself, so H will strip this header, and discover the original packet header or the inner header.
9. H 现在将在其本地路由表中查找 M,并发现 M 是本地连接的。H 将在数据包上放置正确的链路本地标头,并通过正确的接口进行传输,以便数据包到达 M。
9. H will now look up M in its local routing table and discover M is locally connected. H will place the correct link local header on the packet and transmit it through the correct interface so the packet reaches M.
如果 C 和 H 使用 VRF 而不是隧道接口,则前面列表中的过程在步骤 2 和 8 处发生变化。在步骤 2 中,C 将在与 [A,C] 链路关联的 VRF 中查找 M 作为目的地。当 C 发现发往 M 的流量应通过 H 的虚拟拓扑转发时,它将在数据包上放置一个外部标头,并基于此外部标头,通过基础 VRF(或者更确切地说,代表该数据包的路由表)再次处理该数据包。物理拓扑。当 H 收到数据包时,它将剥离外层标头,并使用 M 连接的 VRF 再次处理该数据包,以查找将流量转发到最终目的地所需的信息。隧道接口,在这种情况下,被替换为单独的转发表;数据包不是通过两个不同的目的地通过同一个表处理两次,而是通过两个不同的转发表进行处理。
If C and H are using VRFs rather than tunnel interfaces, the process in the preceding list changes at steps 2 and 8. At step 2, C will look up M as a destination in the VRF associated with the [A,C] link. When C finds that traffic toward M should be forwarded through a virtual topology via H, it will place an outer header on the packet and process the packet again, based on this outer header, through the base VRF, or rather the routing table representing the physical topology. When H receives the packet, it will strip off the outer header and process the packet again using the VRF to which M is connected to look up the information needed to forward the traffic to its final destination. The tunnel interface, in this case, is replaced with a separate forwarding table; rather than processing the packet through the same table twice using two different destinations, the packet is processed through two different forwarding tables.
隧道一词有许多不同的定义;在本书中,隧道将用于描述虚拟链路,其中外部标头用于封装内部标头,并且
The term tunnel has many different definitions; for this book, a tunnel will be used to describe a virtual link where an outer header is used to encapsulate an inner header, and
• 内层报头与外层报头位于同一层或更低层(例如,IPv6 报头内承载的以太网报头;通常 IPv6 承载于以太网内)。
• The inner header is at the same layer, or a lower layer, than the outer header (for instance, an Ethernet header carried inside an IPv6 header; normally IPv6 is carried inside Ethernet).
• 路径中至少有一些网络设备(无论是虚拟的还是物理的)仅根据外部标头转发数据包。
• At least some network devices in the path, whether virtual or physical, forward the packet based on the outer header alone.
从虚拟接口转向 VRF 在概念上的不同足以产生不同的描述性术语。底层是物理(或潜在的逻辑!)拓扑,流量通过该拓扑进行隧道传输。覆盖层是构成虚拟拓扑的隧道集。大多数时候,术语底层和覆盖不用于单个隧道,或者在公共互联网上运行的服务的情况下。在公共互联网上构建虚拟拓扑的服务通常称为 OTT 服务。
Moving from virtual interfaces to VRFs is conceptually different enough to engender different descriptive terms. The underlay is the physical (or potentially logical!) topology through which traffic is tunneled. The overlay is the set of tunnels making up the virtual topology. Most of the time, the terms underlay and overlay are not used with single tunnels, or in the case of a service running over the public Internet. A service that builds a virtual topology across the public Internet is often called an over-the-top service.
同样,在更大的网络工程世界中,这些术语在某种程度上可以互换使用,甚至以非常草率的方式使用。在此背景下,是时候转向用例了,以便了解虚拟化解决方案需要解决的问题集。
Again, these terms are used somewhat interchangeably, and even in a very sloppy way, in the larger network engineering world. With this background, it is time to turn to use cases, in order to inform the problem set virtualization solutions need to solve.
尽管应用程序不应以以太网连接作为基本假设来构建,但许多应用程序都是如此。例如:
Although applications should not be built with Ethernet connectivity as an underlying assumption, many are. For instance:
• 一些存储和数据库供应商在构建其设备时假设以太网连接意味着短距离和短延迟,或者他们直接在以太网帧之上而不是在互联网协议 (IP) 数据包之上设计基于专有传输协议的系统。
• Some storage and database vendors build their devices with the assumption that Ethernet connectivity means short distance and short delay, or they design systems on top of proprietary transport protocols directly on top of Ethernet frames, rather than on top of Internet Protocol (IP) packets.
• 某些虚拟化产品将有关连接的假设嵌入到其操作中,例如默认网关和其他可到达目的地的以太网到IP 地址缓存的可靠性。
• Some virtualization products embed assumptions about connectivity into their operation, such as the reliability of the Ethernet to IP address cache for the default gateway and other reachable destinations.
此类应用程序需要运行不同节点或应用程序副本的设备(无论是物理还是虚拟)之间的以太网链路。除此之外,一些网络运营商认为运行大型扁平以太网域比运行大规模 IP 域更简单,因此他们更愿意构建尽可能最大的以太网域(“在可以切换的地方进行切换,在必须进行路由的地方进行路由” )这是在交换由硬件执行而路由由软件执行的时代流行的说法,因此交换数据包比路由数据包快得多)。一些园区的基本理念是:一旦设备连接,就不再要求设备切换其 IP 地址。由于用户可能根据其安全域连接到不同的以太网段,因此每个以太网段必须在每个无线接入点上可用,并且通常在园区的每个以太网端口上可用。
These kinds of applications require what appears to be an Ethernet link between the devices (whether physical or virtual) running different nodes or copies of the application. Beyond this, some network operators believe running a large flat Ethernet domain is simpler than running a large-scale IP domain, so they would prefer to build the largest Ethernet domains they can (“switch where you can, route where you must” was a common saying in the days when switching was performed in hardware, while routing was performed in software, so switching packets was much faster than routing them). Some campuses are also built with the underlying idea of never asking a device to switch their IP address once they are connected. As users may be connected to different Ethernet segments based on their security domain, so each Ethernet segment must be available at every wireless access point and often at each Ethernet port in the campus.
给定一个基于 IP 的网络,预计以太网是运行 IP 的众多传输之一,那么如何为通过 IP 网络互连的设备提供以太网连接?图9-3说明了需要解决的问题。
Given a network based on IP, which anticipates Ethernet as one of the many transports on top of which IP will run, how can you provide Ethernet connectivity to devices interconnected over an IP network? Figure 9-3 illustrates the problems to be solved.
在图 9-3中,运行在 A 上、IP 地址为 2001:db8:3e8:100::1 的进程需要能够与运行在 B 上、IP 地址为 2001:db8:3e8:100 的服务进行通信: :2 就好像它们位于同一以太网段上(这两个服务需要在邻居发现等中互相看到)。为了使问题更加复杂,A 处的服务还需要能够移动到 K,而不更改其本地邻居发现缓存或默认路由器。网络本身(显示为脊叶结构的一小部分)是运行 IPv6 的路由网络。
In Figure 9-3, a process running on A, with the IP address 2001:db8:3e8:100::1 needs to be able to communicate with a service running on B with the IP address 2001:db8:3e8:100::2 as if they are on the same Ethernet segment (the two services need to see one another in neighbor discovery, etc.). To make the problem more complex, the service at A also needs to be able to move to K without changing its local neighbor discovery cache or default router. The network itself, which is shown as a small section of a spine and leaf fabric, is a routed network running IPv6.
需要什么才能满足要求?
What would be required to allow the requirements to be met?
必须有一种方法可以通过分隔服务器的 IP 网络传输以太网帧。这通常是某种形式的隧道封装,如本节开头所述。隧道将允许以太网帧例如,在 C 处接收,封装在某种外部标头中,以便它们可以通过路由网络传输。当包含以太网帧的数据包到达D时,可以剥离该外层报头并在本地转发以太网帧。从D的角度来看,该框架是本地起源的。
There must be a way to carry Ethernet frames over the IP network separating the servers. This would normally be some form of tunneling encapsulation, as described at the beginning of this section. Tunneling would allow Ethernet frames to be received at C, for instance, encapsulated in some sort of outer header so they can be transported across the routed network. When the packet containing the Ethernet frame reaches D, this outer header can be stripped off and the Ethernet frame forwarded locally. From the perspective of D, the frame is locally originated.
必须有一种方法来了解通过隧道可到达的目的地并将流量吸引到隧道中。这实际上是两个独立但相关的问题。将流量引入隧道可能需要运行具有自己的 VRF 的第二个控制平面,或者将有关每个边缘路由器可访问的以太网媒体访问控制 (MAC) 地址的附加信息添加到现有控制平面中。
There must be a way to learn about the destinations reachable via the tunnel and draw traffic into the tunnel. These are actually two separate, but related, problems. Drawing traffic into the tunnel might involve running a second control plane with its own VRFs, or adding additional information into an existing control plane about the Ethernet Media Access Control (MAC) addresses reachable at each edge router.
可能需要将服务质量 (QoS) 标记从内部标头传输到外部标头,以便在转发流量时正确处理流量。有关在隧道中的两个标头之间携带 QoS 标记的更多信息,请参阅第 8 章“服务质量”。
There may be a requirement to transfer Quality of Service (QoS) markings from the inner header to the outer header, so traffic is handled correctly when it is forwarded. See Chapter 8, “Quality of Service,” for more information on carrying QoS markings between the two headers in a tunnel.
几乎每个组织都有某种类型的远程工作人员,要么是全职人员,要么只是出差人员,并且大多数组织都有某种类型的远程办公室,其中一小群人在远离主办公室的地方工作,与某些地方的当地社区进行互动。方式,例如零售或销售。所有这些人仍然需要访问网络资源,例如电子邮件、旅行系统、文件等。当然,这些服务不能暴露在公共互联网上,因此必须提供一些其他访问机制。图 9-4说明了问题空间。
Almost every organization has remote workers of some sort, either full time, or just people who travel, and most organizations have remote offices of some kind, where a small group of people work away from the main office to interact with a local community in some way, such as retail or sales. All of these people still need access to network resources, such as email, travel systems, files, etc. These services cannot be exposed to the public Internet, of course, so some other access mechanism must be provided. Figure 9-4 illustrates the problem space.
此用例有两个主要问题:
There are two primary concerns in this use case:
• 如何保护单个主机(B)与小型办公室中的三台主机(C、D、E)之间的流量不被攻击者拦截和读取?如何保护目标地址本身不暴露在公共网络中?这些问题涉及某种安全性,而安全性又意味着某种形式的数据包封装。
• How can the traffic between the individual host—B—and the three hosts in the small office—C, D, and E—be protected from being intercepted and read by an attacker? How can the destination addresses themselves be protected from exposure into the public network? These problems involve some sort of security, which, in turn, implies some form of packet encapsulation.
• 如何管理这些远程位置的用户体验质量以支持IP 语音和其他实时应用程序?由于 Internet 上的提供商不支持服务质量,因此必须提供其他形式的质量保证。
• How can the quality of the user’s experience in these remote locations be managed to support voice over IP and other real-time applications? Because providers on the Internet do not support quality of service, some other form of quality assurance must be provided.
那么,这里要解决的问题包括两个更普遍的问题。
The problem set to solve here, then, includes two more general issues.
•必须有一种方法来封装通过公共网络传送的流量,而不会暴露原始标头信息,也不会暴露数据包中携带的信息以供检查。解决这些问题的最简单的解决方案是将流量从 A 和 F 通过隧道传输到组织网络 G 中的边缘路由器(通常在加密隧道中),在 G 中可以删除封装并将数据包转发到 A。
• There must be a way to encapsulate the traffic being carried across the public network without exposing the original header information and without exposing the information carried in the packet to inspection. The easiest solution for these problems is to tunnel (often in an encrypted tunnel) the traffic from A and F to the edge router in the organization’s network, G, where the encapsulation can be removed and the packets forwarded to A.
•必须有一种方法来通告从 G 到远程用户的可到达目的地、以及 G 的远程用户的存在(或可达性)以及 G 后面的网络。必须使用此可达性信息将流量引入到 G 中。隧道。在这种情况下,控制平面可能需要将各个入口和出口点之间的流量重定向到公共网络,并尝试控制流量通过网络的路径,以确保远程用户接收到良好的质量经验。
• There must be a way to advertise the reachable destinations from G toward the remote users, and the existence of (or reachability of) the remote users to G, and the network behind G. This reachability information must be used to draw traffic into the tunnels. The control plane, in this case, may need to redirect traffic among the various entry and exit points to the public network, and try to control the path of the traffic through the network, in order to ensure the remote users receive a good quality of experience.
前面几节中的两个用例揭示了每个网络虚拟化解决方案必须解决的两个问题:
The two use cases in the preceding sections expose the two questions every network virtualization solution must solve:
如何将流量封装在隧道内,以便将数据包和控制平面信息与底层网络分开?
How is traffic encapsulated within the tunnel so the packets and control plane information can be separated from the underlying network?
此问题的解决方案通常是采用某种形式的封装,在通过网络传送原始数据包时将其放入其中。封装的主要考虑因素是底层网络中的硬件交换支持,以允许有效转发封装的数据包。第二个考虑因素是封装数据包格式的大小;附加封装标头的每个八位字节都会减少隧道可以承载的有效负载量(除非最大传输单元 (MTU) 之间存在差异,网络中的最大传输单元或 MTU 是为考虑隧道强加的附加标头信息而设计的)。
The solution for this problem is generally some form of encapsulation into which the original packet is placed as it is carried through the network. The primary consideration for the encapsulation is hardware switching support in the underlay network, to allow the efficient forwarding of encapsulated packets. A secondary consideration is the size of the encapsulating packet format; each octet of additional encapsulation header reduces the amount of payload the tunnel can carry (unless there is a differential between the Maximum Transmission Unit, or MTU, in the network designed to account for the additional header information tunneling imposes).
笔记
Note
路径 MTU 检测 (PMTUD) 通常在检测封装数据包的 MTU 方面效果不佳;因此,通常需要在施加隧道标头的位置手动调整 MTU。
Path MTU Detection (PMTUD) often does a poor job of detecting the MTU of encapsulated packets; because of this, manual tuning of MTU at the point where the tunnel header is imposed is often required.
如何通过网络通告的隧道到达目的地?
How are the destinations reachable through the tunnel advertised through the network?
在更通用的隧道解决方案中,隧道成为整个网络拓扑中的“另一条链路”。通过隧道可到达的目的地以及附加的虚拟链路只是作为控制平面的一部分包含在内,就像任何其他控制平面一样其他目的地和链接。在这些解决方案中,每个设备中都有一个路由表或转发表,并且在流量进入隧道或隧道头端的点处使用递归查找通过转发来处理数据包。通过修改度量将流量引入隧道,以便对于网络运营商希望通过隧道到达的目的地而言,隧道是通过网络的更理想的路径。这通常意味着主要手动解决将流量引入隧道的问题,例如将隧道度量设置为低于隧道运行的路径,然后过滤通过隧道通告的目的地,以防止通告应该被通告的目的地。无法通过隧道到达。事实上,如果通过隧道可达的目的地包括隧道终止点(隧道尾端),则可能会形成永久路由环路,或者隧道将在正确转发流量和根本不转发流量之间循环。
In more general tunneled solutions, the tunnel becomes “just another link” in the overall network topology. The destinations reachable through the tunnel, and the additional virtual link, are simply included as a part of the control plane, like any other destinations and links. In these solutions, there is one routing or forwarding table in each device, and a recursive lookup is used to process the packet through forwarding at the point where traffic enters the tunnel, or the tunnel headend. Traffic is drawn into the tunnel by modifying the metrics so the tunnel is a more desirable path through the network for those destinations the network operator would like to be reached through the tunnel. This generally means largely manual solutions to the problem of drawing traffic into the tunnel, such as setting the tunnel metric lower than the path over which the tunnel runs, and then filtering the destinations advertised through the tunnel to prevent the advertisement of destinations that should be unreachable through the tunnel. In fact, if the destinations reachable through the tunnel include the tunnel termination point (the tunnel tailend), a permanent routing loop can form, or the tunnel will cycle between forwarding traffic correctly and not forwarding traffic at all.
在覆盖和机顶解决方案中,部署单独的控制平面(或者为单个控制平面中的底层和覆盖中可到达的目的地携带单独的可达性信息数据库)。通过底层和覆盖层可到达的目的地被放置在隧道头端的单独路由表(VRF)中,并且用于转发流量的表基于某种形式的分类系统。例如,在特定接口上接收到的所有数据包都可以自动放入覆盖隧道中,或者在其数据包标头中设置了特定服务类别的所有数据包,或者发往特定目的地集的所有流量。完全覆盖和 OTT 虚拟化机制通常不依赖于指标来将流量引入头端的隧道。
In overlay and over-the-top solutions, a separate control plane is deployed (or a separate database of reachability information is carried for the destinations reachable in the underlay and overlay in a single control plane). Destinations reachable through the underlay and overlay are placed into separate routing tables (VRFs) at the tunnel headend, and the table used to forward traffic is based on some form of classification system. For instance, all the packets received on a particular interface may be placed into an overlay tunnel automatically, or all the packets with a specific class of service set in their packet headers, or all traffic destined to a specific set of destinations. Full overlay and over-the-top virtualization mechanisms do not generally rely on metrics to draw traffic into the tunnel at the headend.
另一个可选要求是提供服务质量,或者通过将 QoS 信息从内部报头复制到外部报头,或者通过使用某种形式的流量工程沿着最佳可用路径传送流量。
One other optional requirement is to provide for quality of service, either by copying the QoS information from the inner header to the outer header, or by using some form of traffic engineering to carry traffic along the best available path.
分段路由 (SR) 可能被视为隧道解决方案,也可能不被视为隧道解决方案,具体取决于具体实现以及您希望遵守本章前面“了解虚拟网络”部分中提出的隧道定义的强烈程度。本节将考虑分段路由的基本概念和两种可能的实现方案——一种使用 IPv6 流标签,另一种使用多协议标签交换 (MPLS) 标签。
Segment routing (SR) may, or may not, be considered a tunneled solution, based on the specific implementation, and how strongly you want to adhere to the definition of tunnels presented in the “Understanding Virtual Networks” section earlier in this chapter. This section will consider the basic concept of segment routing and two possible implementation schemes—one using IPv6 flow labels and one using Multi-protocol Label Switching (MPLS) labels.
支持 SR 的网络中的每个设备都被赋予一个唯一的标签。根据这些唯一标签描述路径的标签堆栈可以附加到任何数据包,使其采用指示的特定路径。图 9-5说明了这一点。
Each device in an SR-enabled network is given a unique label. A label stack describing the path in terms of these unique labels can be attached to any packet, causing it to take the specific path indicated. Figure 9-5 illustrates.
图 9-5中的每个路由器都会通告一个 IP 地址作为标识符以及附加到该 IP 地址的标签。在SR中,附加在路由器标识符上的标签称为节点段标识符(节点SID)。由于网络中的每个路由器都被分配了唯一的标签,因此可以仅使用这些标签来描述通过网络的路径。例如:
Each router in Figure 9-5 advertises an IP address as an identifier along with a label attached to this IP address. In SR, the label attached to the router identifier is called a node segment identifier (node SID). As each router in the network is assigned a unique label, a path can be described through the network using just these labels. For instance:
• 如果您想沿着路径[B,E,F,H] 将流量从A 转发到K,您可以使用标签[101,104,105,107] 描述该路径。
• If you wanted to forward traffic from A to K along the path [B,E,F,H], you could describe this path using the labels [101,104,105,107].
• 如果您想沿着路径[B,D,G,H] 将流量从A 转发到K,您可以使用标签[101,103,106,107] 描述该路径。
• If you wanted to forward traffic from A to K along the path [B,D,G,H], you could describe this path using the labels [101,103,106,107].
用于描述路径的标签集称为标签堆栈。D 和 H 之间有两个链接;这该如何描述呢?SR 中有多个可用选项,包括:
The set of labels used to describe a path is called the label stack. There are two links between D and H; how can this be described? There are several options available in SR, including:
• 标签堆栈可能只包括节点SID,以路由器的形式描述通过网络的路径,如前面所示。在这种情况下,如果标签堆栈包含对[103,107],D将根据本地路由信息正常转发到H,因此它将使用在转发任何其他数据包时使用的任何本地进程,例如跨负载共享两个链路,也转发 SR 标记的流量。
• The label stack may include just the node SIDs describing the path through the network in terms of the routers, as previously shown. In this case, if the label stack included the pair [103,107], D would simply forward to H normally, based on local routing information, so it would use whatever local process it would use in forwarding any other packet, such as load sharing across the two links, to forward the SR-labeled traffic, as well.
• 标签堆栈可以包括显式标签,以在网络中此时可用的任何可用路径集上进行负载共享。
• The label stack could include an explicit label to load share over any available set of paths available at this point in the network.
• H 可以为每个入站接口分配一个标签,以及与其本地路由器标识符相关的节点SID。这些标签将像节点 SID 一样进行通告,但由于它们描述邻接,因此称为邻接 SID。邻接SID是本地唯一的;它对于通告邻接 SID 本身的路由器来说是唯一的。
• H could assign a label per inbound interface, as well as a node SID tied to its local router identifier. These labels would be advertised just like the node SID, but as they describe an adjacency, they are called an adjacency SID. The adjacency SID is locally unique; it is unique to the router advertising the adjacency SID itself.
第三种 SID,即前缀 SID,描述网络内特定的可达目的地(前缀)。节点 SID 可以实现为与网络中每个路由器上的环回地址绑定的前缀 SID。
A third kind of SID, the prefix SID, describes a specific reachable destination (a prefix) within the network. A node SID can be implemented as a prefix SID tied to a loopback address on each router in the network.
整个路径不需要标签栈来描述。例如,标签堆栈[101,103]会将流量引导至B,然后引导至D,但随后允许D使用任何可用路径到达K处的目标IP地址。标签堆栈[105]将确保流量通过通往 K 的网络将经过 F;流量如何到达网络中的该点并不重要,到达 F 后如何转发并不重要,只要它在转发到 K 时经过 F 即可。
The entire path does not need to be described by the label stack. For instance, the label stack [101,103] would direct traffic to B, then to D, but would then allow D to use any available path to reach the destination IP address at K. The label stack [105] would ensure traffic passing through the network toward K would pass through F; it does not matter how the traffic reached that point in the network, nor how it was forwarded after it reaches F, so long as it passes through F while being forwarded toward K.
堆栈中的每个标签代表一个段;数据包在网络中的每个网段上从一个标签传送到另一个标签,然后从路径的头端传输到路径的尾端。
Each label in the stack represents a segment; packets are carried from label to label across each segment in the network to be transported from the headend of the path to the tailend of the path.
MPLS 的发明是为了将不再广泛部署的异步传输模式 (ATM) 的优点与 IP 交换相结合。在网络工程的早期,用于交换数据包的芯片组的功能比现在受到更多限制。许多使用的芯片组是现场可编程门阵列 (FPGA),而不是专用集成电路 (ASIC),因此数据包交换的字段长度与数据包交换的速度直接相关。回收数据包或处理两次数据包通常比在标头中包含大量复杂信息以便数据包可以处理一次更容易。
MPLS was invented as a way to blend the advantages of Asynchronous Transfer Mode (ATM), which is no longer widely deployed, with IP switching. In the earlier days of network engineering, the chipsets used for switching packets were more constrained in their capabilities than they are now; many of the chipsets being used were Field Programmable Gate Arrays (FPGAs) rather than Application-Specific Integrated Circuits (ASICs), so the length of the field on which the packet was switched was directly correlated to the speed at which the packet could be switched. It was often easier to recycle a packet, or to process it twice, than it was to include a lot of complex information in the header so the packet can be processed once.
笔记
Note
数据包回收仍然经常在许多芯片组中使用,以支持内部和外部标头,甚至处理更长、更复杂的数据包标头的不同部分。
Packet recycling is still often used in many chipsets to support inner and outer headers, or even to process different parts of a longer, more complex, packet header.
MPLS 将原始数据包封装到 MPLS 标头中,然后使用该标头通过网络交换数据包。MPLS 报头如图 9-6所示。
MPLS encapsulates the original packet into an MPLS header, which is then used to switch the packet through the network. Figure 9-6 shows the MPLS header.
整个报头为32位;标签是20位。MPLS转发设备可以执行三种操作:
The entire header is 32 bits; the label is 20 bits. Three operations can be carried out by an MPLS forwarding device:
• MPLS 标头中的当前标签可以与另一个标签交换(SWAP)。
• The current label in the MPLS header can be swapped with another label (SWAP).
• 可以将新标签推送到数据包上(PUSH)。
• A new label can be pushed onto the packet (PUSH).
• 可以弹出当前标签,并对当前标签下的标签进行处理(POP)。
• The current label can be popped, and the label under the current label processed (POP).
PUSH和POP操作直接带入SR;SWAP操作在SR中实现为CONTINUE,这意味着将当前标签与相同标签进行交换(即标签为100的标头将被替换为100的标签),并且将继续当前段的处理。理解处理的最简单方法是通过示例;图 9-7说明了这一点。
The PUSH and POP operations are carried directly into SR; the SWAP operation is implemented in SR as a CONTINUE, which means the current label is swapped with the same label (i.e., a header with the label 100 will be replaced with a label of 100), and the processing of this current segment will continue. The easiest way to understand the processing is through an example; Figure 9-7 illustrates.
在图9-7中,每个路由器都有一个从分段路由全局块(SRGB)分配的全局唯一标签;它们通过路由协议或其他控制平面进行通告。当 A 收到发往 N 的数据包时,它将使用某种本地机制选择一条通过网络的路径。在此刻:
In Figure 9-7, each router has a globally unique label assigned from the Segment Routing Global Block (SRGB); these are advertised through a routing protocol or some other control plane. When A receives a packet destined for N, it will choose a path through the network using some local mechanism. At this point:
• 为了开始该过程,A 将在数据包上推送一系列 MPLS 标头,描述通过网络的路径[101,103,104,202,105,106,109,110]。当 A 将数据包切换到 B 时,它将 POP 堆栈中的第一个标签,因为不需要在标头中向 B 发送自己的标签。[A,B] 链接上的标签堆栈将为 [103,104,202,105,106,109,110]。
• To begin the process, A will PUSH a series of MPLS headers on the packet that describe the path through the network, [101,103,104,202,105,106,109, 110]. When A switches the packet toward B, it will POP the first label in the stack, as there is no need to send B its own label in a header. The label stack on the [A,B] link will be [103,104,202,105,106,109,110].
• 当B 收到数据包时,它会检查堆栈上的下一个标签。发现标签为103,它将POP该标签并将数据包转发到D。 SR标签在本例中,堆栈已从网络中选择了两条可能的等价路径之一,因此这是 SR 选择特定路径的示例。[B,D] 链接上的标签堆栈将为 [104,202,105,106,109,110]。
• When B receives the packet, it examines the next label on the stack. Finding the label to be 103, it will POP this label and forward the packet to D. The SR label stack, in this case, has picked out one of two possible equal cost paths through the network, so this is an example of SR choosing a specific path. The label stack on the [B,D] link will be [104,202,105,106,109,110].
• 当D 收到数据包时,栈顶标签将为104;D 将 POP 该标签并将数据包发送到 E。 [D,E] 链路上的标签堆栈将为 [202,105,106,109,110]。
• When D receives the packet, the top label on the stack will be 104; D will POP this label and send the packet to E. The label stack on the [D,E] link will be [202,105,106,109,110].
• 当E 收到此数据包时,堆栈上的顶部标签为202。这是一个邻接选择器,因此它选择特定接口而不是特定邻居。E 将选择正确的接口,即图中两个接口中较低的一个,并弹出此标签。顶部标签现在是 F 的节点 SID,可以将其删除,因为数据包正在传输到 F;E 将回收该数据包并弹出该标签。[E,F] 链接上的标签堆栈将为 [106,109,110]。
• When E receives this packet, the top label on the stack is 202. This is an adjacency selector, so it selects for a specific interface rather than a specific neighbor. E will select the correct interface, the lower of the two interfaces in the illustration, and POP this label. The top label is now the node SID for F, which can be removed, since the packet is being transmitted to F; E will recycle the packet and POP this label as well. The label stack on the [E,F] link will be [106,109,110].
• 当数据包到达 F 时,堆栈中的下一个标签是 106。该标签指示数据包应传输到 G。F 将 POP 标签并将其传输到 G。[F,G] 链路上的标签堆栈将是[109,110]。
• When the packet reaches F, the next label in the stack is 106. This label indicates the packet should be transmitted to G. F will POP the label and transmit it to G. The label stack on the [F,G] link will be [109,110].
• 当数据包到达 G 时,堆栈上的下一个标签是 109,这表明数据包应转发到 L。由于 G 不直接连接到 L,因此它可以使用本地无环路(通常是最短的)路径在这种情况下,有两条通往 L 的等成本路径,因此 G 将 POP 109 标签并通过这两条路径之一转发至 L。在 [G,L] 段上,标签堆栈为 [110] 。
• When the packet reaches G, the next label on the stack is 109, which indicates the packet should be forwarded toward L. As G is not directly connected to L, it can use a local, loop-free (generally the shortest) path toward L. In this case, there are two equal cost paths toward L, so G will POP the 109 label and forward over one of these two paths toward L. On the [G,L] segment, the label stack is [110].
• 假设G 选择通过K 发送数据包。当K 收到数据包时,它会有一个包含[110] 的标签栈,该标签栈不是本地标签,也不是相邻节点。在这种情况下,标签需要保持不变,或者段需要 CONTINUE。为了实现这一点,K 将交换当前标签 110,以获得相同标签的另一个副本,因此 K 将转发具有相同标签的流量。在 [K,L] 链接上,标签堆栈将为 [110]。
• Assume G chooses to send the packet via K. When K receives the packet, it will have a label stack containing [110], which is not the local label, nor it is an adjacent node. In this case, the label needs to remain the same, or the segment needs to CONTINUE. To implement this, K will SWAP the current label, 110, for another copy of the same label, so the K will forward the traffic with the same label. On the [K,L] link, the label stack will be [110].
当L收到数据包时,唯一剩余的标签将为110,这表明数据包应转发到M。L将POP 109标签,有效地删除所有MPLS封装,并将数据包转发到M。
• When L receives the packet, the only remaining label will be 110, which indicates the packet should be forwarded to M. L will POP the 109 label, effectively removing all the MPLS encapsulation, and forward the packet to M.
• 当M 收到数据包时,它将使用普通IP 将数据包转发到最终目的地N。
• When M receives the packet, it will forward the packet using normal IP to N, the final destination.
MPLS 中的标签堆栈概念被实现为一系列相互堆叠的 MPLS 标头。弹出标签意味着删除最上面的标签,推送标签意味着将新的 MPLS 标头添加到数据包上,继续意味着将标签与相同的标签交换。当您使用一堆标签时,内部和外部的概念常常令人困惑,特别是当许多人互换使用标签和标题的概念时。也许减少混淆的最佳方法是使用术语标头来指代整个标签堆栈和 MPLS 内部携带的原始标头,同时将标签称为堆栈中的单独标签。内部标头将是原始数据包标头,而外部标头将是标签堆栈;内部标签将是数据包在网络中传输的任何点上堆栈上的下一个标签,而外部标签将是数据包实际交换的标签。
The stack of labels concept in MPLS is implemented as a series of MPLS headers stacked on top of one another. Popping the label means to remove the topmost label, pushing a label means adding a new MPLS header onto the packet, and continuing means swapping the label with an identical label. When you are working with a stack of labels, the concepts of inner and outer are often confusing, particularly as many people use the idea of a label and a header interchangeably. Perhaps the best way to reduce confusion is to use the term header to refer to the entire label stack and the original header being carried inside MPLS, while referring to the labels as individual labels in the stack. The inner header would then be the original packet header, while the outer header would be the stack of labels; the inner label would be the next label on the stack at any point in the packet’s travels through the network, while the outer label would be the label on which the packet is actually being switched.
尽管此处给出的示例使用 MPLS 内的 IP 数据包,但 MPLS 协议旨在承载几乎任何协议,包括以太网。因此,SR MPLS 不仅限于用于承载单一类型的流量,还可以用于在基于 IP/MPLS 的网络上承载以太网帧。这意味着 SR 可用于支持本章讨论的第一个用例,即通过 IP 网络提供以太网服务。
Although the example given here uses IP packets inside MPLS, the MPLS protocol is designed to carry just about any protocol, including Ethernet. SR MPLS is not, therefore, limited to being used to carry a single type of traffic, but can also be used to carry Ethernet frames over an IP/MPLS-based network. This means SR can be used to support the first use case discussed in this chapter, providing Ethernet services over an IP network.
MPLS 上的 SR 和 IPv6 上的 SR 的操作在所有方面都类似,只是标签栈的承载和处理方式不同。IPv6中的SR头被携带在流标签字段中,如图9-8所示。
The operation of SR on MPLS and SR on IPv6 is similar in all respects except how the label stack is carried and processed. SR headers in IPv6 are carried in the flow label field, shown in Figure 9-8.
在IPv6 SR实现中,SR标签栈携带在IPv6数据包头的路由头中。此标头中的信息专门设计用于提供有关“此数据包”在通过网络路由时应经过的节点的信息,因此它与 SR 标签具有相同的用途堆。在SR的IPv6实现的情况下,每个标签是128位,因此一些本地IPv6地址可以用作SID。
In the IPv6 SR implementation, the SR label stack is carried in the routing header of the IPv6 packet header. The information in this header is designed specifically to provide information about the nodes through which “this packet” should pass when being routed through the network, so it serves the same purpose as the SR label stack. In the case of IPv6 implementations of SR, each label is 128 bits, so some local IPv6 address can be used as an SID.
有趣的一点是 IPv6 规范指示路由器在处理数据包时不得更改 IPv6 标头(有关更多详细信息,请参阅 RFC8200)。那么,SR IPv6 依赖于路径上的每个节点都有一个指向正在处理的堆栈中的当前标签的指针,而不是弹出、推送和交换标签。
The one interesting point is the IPv6 specifications indicate the IPv6 header must not be changed by a router when processing the packet (see RFC8200 for further details). Instead of popping, pushing, and swapping labels, then, SR IPv6 relies on each node along the path having a pointer to the current label in the stack being processed.
从技术上讲,SR 是一种源路由机制,因为源选择通过网络的路径,尽管 SR 中的源路由比传统的源路由宽松得多。对于堆栈上的每个标签,路径上的节点可以通过两种可能的方式处理数据包:
SR is technically a source routing mechanism, because the source chooses the path through the network—although the source routing in SR can be much looser than traditional source routing. For each label on the stack, there are two possible ways a node along the path can process the packet:
• 标签提供了有关如何在此设备上处理数据包的明确说明;POP 或 CONTINUE 段(标签)并相应地处理数据包。
• The label provides explicit instructions about how the packet should be handled at this device; POP or CONTINUE the segment (label) and process the packet accordingly.
• 标签没有提供有关如何在此设备上处理数据包的明确说明;使用本地路由信息转发数据包并继续该段。
• The label does not provide explicit instructions about how the packet should be handled at this device; use local routing information to forward the packet and CONTINUE the segment.
在这两种情况下,处理节点都不需要知道交换数据包的整个路径;它要么简单地遵循指定的标签路径,要么根据纯粹的本地信息处理数据包。由于这种范例,发送 SR 信号很简单。需要发生两种类型的信号。
In neither case does the processing node need to know about the entire path to switch the packet; it either simply follows the label path as specified, or it processes the packet based on purely local information. Because of this paradigm, signaling SR is simple. Two types of signaling need to occur.
分配给网络中节点的本地节点、前缀和邻接SID需要由网络中的每个节点通告。该信令主要在路由协议中承载;例如,中间系统到中间系统 (IS-IS) 协议通过IS-IS 分段路由扩展草案进行了扩展1使用子类型长度值(sub-TLV)承载前缀SID,如图9-9所示。
The local node, prefix, and adjacency SIDs assigned to a node in the network need to be advertised by each node in the network. This signaling is primarily carried in routing protocols; for instance, the Intermediate System to Intermediate System (IS-IS) protocol is extended by the draft IS-IS Extensions for Segment Routing1 to carry prefix SIDs using a sub Type Length Value (sub-TLV), as shown in Figure 9-9.
还提出了对其他路由和控制平面协议的扩展以实现标准化;有关这些扩展提案的列表,请参阅本章末尾的“进一步阅读”部分。由于SR中的路径计算是基于源的,因此分布式路由协议中不需要携带路径。唯一真正的需要是为网络中的每个节点提供携带 SR 节点、前缀和邻接信息所需的信息。
Extensions to other routing and control plane protocols are proposed for standardization, as well; see the “Further Reading” section at the end of the chapter for a list of these extension proposals. Because path calculation in SR is source based, there is no need to carry a path in a distributed routing protocol. The only real need is to provide each node in the network with the information needed to carry SR node, prefix, and adjacency information.
在 SR 路径由集中式设备或控制器计算的情况下,需要有一种方法来通告要使用的标签路径,以便到达特定目的地。在 BGP 中的广告段路由策略中,已提议对边界网关协议 (BGP) 进行扩展,2以及分段路由的 PCEP 扩展中的路径计算元素协议 (PCEP) 。3这两种通告是相互独立的,因为网络中唯一需要计算或施加分段列表的节点是隧道头端或流量进入分段路径的点。
In the case where SR paths are calculated by a centralized device or controller, there needs to be a way to advertise a label path to use in order to reach a particular destination. Extensions have been proposed to the Border Gateway Protocol (BGP) in Advertising Segment Routing Policies in BGP,2 and in the Path Computation Element Protocol (PCEP) in PCEP Extensions for Segment Routing.3 These two kinds of advertisements are separate from one another, as the only node in the network that needs to either calculate or impose the segment list is the tunnel headend or the point where traffic enters the segment path.
许多组织需要配置和支持大量远程办公室。例如:
Many organizations need to provision and support large numbers of remote offices. For instance:
• 零售连锁店可能在全球拥有数百甚至数千家商店和地点。
• Retail chains may have hundreds or even thousands of stores and locations worldwide.
• 地区性银行可能拥有数百个分支机构和数千个提款机地点。
• A regional bank may have hundreds of branch offices and thousands of cash machine locations.
当固定位置专线服务都是由任意规模的服务提供商提供时,此类问题可以通过大规模的星型网络来解决。图 9-10展示了中心辐射型网络。
When fixed location private line services were all service providers offered at any scale, these kinds of problems were solved using large-scale hub-and-spoke networks. Figure 9-10 illustrates a hub-and-spoke network.
图9-10所示的网络实际上很小;远程站点中心的三个点可能代表数百或数千个其他站点。在许多实现中(尤其是较旧的实现),两个中心路由器 A 和 B 以及远程路由器(例如 C 和 N)之间的链路是点对点链路。这意味着中心路由器必须为每个远程路由器配置一个接口,路由过滤器、数据包过滤器和任何服务质量配置。这不仅从配置的角度来看是一个主要问题,而且在处理器和内存利用率方面维护数千个单独的邻居也很困难。
The network shown in Figure 9-10 is actually rather small; the three dots in the center of the remote sites may represent hundreds or thousands of additional sites. In many implementations (especially older ones), the links between the two hub routers, A and B, and the remotes, such as C and N, are point-to-point links. This means the hub router must have an interface configured for each remote router, routing filters, packet filters, and any Quality of Service configurations. Not only is this a major problem from a configuration perspective, but it is also difficult to maintain thousands of individual neighbors in terms of processor and memory utilization.
为了减少维护此类网络所需的处理能力,修改了协议以防止将远程站点视为树的一部分。相反,这些修改允许将这些远程站点视为叶子或存根网络。使此类网络更易于创建和管理的另一个步骤是在中心路由器处使用点对多点接口(具有适当的底层技术,例如帧中继)。当到远程站点的连接配置为点对多点时,中心路由器 A 和 B 会将所有分支视为位于单个广播网段(实际上类似于以太网网段)。然而,每个分支路由器仍然将其与中心路由器的连接视为点对点链路。即使进行了这些修改,建设和维护如此庞大的网络仍然非常困难。必须购买并管理到每个远程站点的链路,必须配置和管理远程设备,必须管理中心路由器的配置等。
To reduce the amount of processing power required in maintaining such a network, protocols were modified to prevent treating the remote sites as if they were part of the tree. Instead, these modifications allowed these remote sites to be treated as if they were leaves, or stub networks. Another step toward making these kinds of networks easier to create and manage was using a point-to-multipoint interface (with the appropriate underlying technology, such as Frame Relay), at the hub routers. When the connections to the remote sites are configured as point-to-multipoint, the hub routers, A and B, treat all the spokes as if they are on a single broadcast segment (like an Ethernet segment, in effect). Each spoke router, however, still treats its connection to the hub routers as a point-to-point link. Even with these modifications, building and maintaining such large networks is still very difficult. Links must be purchased, and managed to each remote site, remote equipment must be configured and managed, the configuration of the hub routers must be managed, etc.
软件定义广域网 (SD-WAN) 解决方案最初是为了解决这一特定问题而开发的。起源于思科的动态多点虚拟专用网络 (DMVPN),DMVPN 背后的想法是使用在公共互联网之上运行的隧道覆盖网络或 Over-the-Top 网络。这允许远程站点使用本地可用的互联网连接,而不是为每个站点购买电路,并通过自动配置和其他工具减少配置和维护时间。
Software-Defined Wide Area Network (SD-WAN) solutions were originally developed to solve this specific problem set. Originating in Cisco’s Dynamic Multi-point Virtual Private Network (DMVPN), the idea behind the DMVPN was to use a tunneled overlay, or over-the-top, network running on top of the public Internet. This allowed the remote sites to use locally available Internet connectivity, rather than purchasing a circuit per site, and reduced configuration and maintenance time through autoconfiguration and other tools.
SD-WAN 将 OTT 网络的概念更进一步。SDWAN 解决方案通常使用多个组件构建:
SD-WAN takes the concept of an over-the-top network one step further. An SDWAN solution is normally built using several components:
• 专门的设备或虚拟化服务,用于替换通常放置在中心和分支位置的路由器
• A specialized appliance or virtualized service to replace the routers normally placed at the hub and spoke locations
• 标准路由协议的修改版本,用于提供可达性(以及可能的电路活跃度的一种衡量标准)并通过网络传递策略
• A modified version of a standard routing protocol to provide reachability (and potentially one measure of circuit liveness) and to pass policies through the network
• IP 安全性 (IPsec) 或传输层安全性 (TLS) 的实施,以在中心辐射型设备之间提供安全的隧道传输
• An implementation of either IP Security (IPsec) or Transport Layer Security (TLS) to provide secure tunneled transport between the hub-and-spoke devices
• 一个控制器,用于监视每个虚拟链路的状态、使用该链路的应用程序以及吞吐量与流量的关系,并对流量和 QoS 设置进行动态调整,以优化跨网络的应用程序操作。顶级虚拟网络
• A controller to monitor the state of each virtual link, the applications using the link, and the amount of goodput versus the amount of traffic, and to make dynamic adjustments to traffic flow and QoS settings to optimize application operation across the over-the-top virtual network
There are many different ways in which SD-WANs can be implemented; for instance:
• SD-WAN可以替代“最后一公里”;您可以使用 SD-WAN 解决方案到达交换点或托管点,然后通过提供商通过更传统的服务将流量传输回中心路由器(这是回程的一种形式),而不是为每个远程站点安装电路)。
• The SD-WAN can replace the “last mile”; rather than installing a circuit to each remote site, you can use SD-WAN solutions to reach an exchange or colocation point, and then carry the traffic through a more traditional service through a provider back to the hub routers (this is a form of backhaul).
• SD-WAN 可以取代从组织网络到远程站点的整个路径。
• The SD-WAN can replace the entire path from the organization’s network to the remote sites.
• SD-WAN 可用于将流量引入云服务,在云服务中可能会进行一些初步处理,或者可能会部署一些应用程序,只有必须传送到组织网络的流量将其余部分传送到云服务中。集线器路由器。
• The SD-WAN can be used to draw traffic into a cloud service, where some preliminary processing might take place, or some applications might be deployed, with just traffic that must be carried into the organization’s network carried the rest of the way into the hub routers.
与任何其他网络技术一样,SD-WAN 和其他 OTT 解决方案也需要权衡。例如,在某些情况下,通过“普通”公共互联网连接(或一对服务,或其他一些以太网终止服务)推送企业远程站点流量可能“足够好”,但提供商倾向于以更高价格处理流量服务更好(自然而然),特别是在停电时。
There are tradeoffs with SD-WAN and other over-the-top solutions, as there are with any other networking technology. For instance, pushing corporate remote site traffic over a “plain” public Internet connection (or pair of services, or some other Ethernet-terminated service) may be “good enough” in some situations, but providers tend to treat traffic in higher-priced services better (naturally enough), particularly in outages.
虚拟化通常是为了找到一种更简单的方法来解决本章开头部分提到的一些问题,例如流量分离。与网络工程领域的所有事物一样,也存在权衡。事实上,如果你还没有找到平衡点,那是因为你还没有足够努力地寻找。本节将考虑网络虚拟化领域中的一些(当然不是全部)各种复杂性权衡。本次讨论的基础是第 1 章“基本概念”中考虑的复杂性权衡三元组:
Virtualization is often undertaken to find a simpler way to solve some of the problems noted in the initial sections of this chapter, such as traffic separation. There are, as with all things in the network engineering world, tradeoffs. In fact, if you have not found the tradeoff, you have not looked hard enough. This section will consider some (though certainly not all) of the various complexity tradeoffs in the realm of network virtualization. The basis of this discussion will be the complexity tradeoff triad considered in Chapter 1, “Fundamental Concepts”:
•状态:网络中状态的数量和状态变化的速度(特别是控制平面)
• State: The amount of state and the speed at which state in the network changes (particularly the control plane)
•优化:网络资源的优化使用,包括流量遵循网络中的最短路径等
• Optimization: The optimal use of network resources, including such things as traffic following the shortest path through the network
•表面:层数、相互作用的深度以及相互作用的广度
• Surface: The number of layers, the depth of their interaction, and the breadth of their interaction
每个构思、实施和部署的虚拟化系统都会产生某种类型的共同风险。例如,考虑承载多个虚拟链路的单个链路,每个虚拟链路都承载流量。很明显(实际上是微不足道的),如果单个物理链路发生故障,所有虚拟链路都将发生故障。当然,您可以简单地将虚拟链路重新路由到另一个物理链路。正确的?也许或也许不是。图 9-11说明了这一点。
Every virtualization system ever conceived, implemented, and deployed creates shared risk of some sort. For instance, consider a single link that is carrying several virtual links, each of which is carrying traffic. It should be obvious (in fact trivial) to observe that if the single physical link fails, all of the virtual links will fail. Of course, you can simply reroute the virtual links onto another physical link. Right? Maybe or maybe not. Figure 9-11 illustrates.
从A和D的角度来看,通过B和C有两条可用的链路,每条链路都在主机和服务器之间提供独立的连接。然而,实际情况是,提供商 1 和提供商 2 都通过提供商 3 的单个链路购买了虚拟链路。当提供商 3 网络中的单个链路发生故障时,流量可能会从通过提供商 1 的路径重定向到通过提供商的路径2,但由于两条链路共享相同的物理基础设施,因此两条链路都无法承载流量。
From the perspective of A and D, there are two links available through B and C, each one providing independent connectivity between the host and the server. The reality is, however, both provider 1 and provider 2 have purchased virtual links through a single link from provider 3. When the single link in provider 3’s network fails, the traffic might be redirected from the path through provider 1 to the path through provider 2, but as both links share the same physical infrastructure, neither link will be able to carry the traffic.
这种情况下的两个链路据说是命运共同体,因为它们是共享风险链路组 (SRLG) 的一部分。可以找到并解决 SRLG 或共同命运的情况,但这样做会增加控制平面和/或网络管理的复杂性。例如,如果不手动测试物理级别的不同故障情况或检查网络图以查找多个虚拟链路通过同一物理链路的位置,就无法发现这些共同的命运情况。在图 9-11中描述的情况下,找到共享命运的情况几乎是不可能的,因为两个提供商都不会告诉您它正在使用来自第二个提供商(如图中的提供商 3 所示)的链接,以便提供服务。
The two links in this situation are said to share fate, because they are part of a Shared Risk Link Group (SRLG). It is possible to find and work around SRLGs, or shared fate situations, but doing so adds complexity to the control plane and/ or network management. For instance, there is no way to discover these shared fate situations without either manually testing different failure situations at the physical level or examining network maps to find places where multiple virtual links pass over the same physical link. In the situation described in Figure 9-11, finding the shared fate situation would be almost impossible, as neither provider is likely to tell you it is using a link from a second provider, shown as provider 3 in the illustration, in order to provide service.
一旦发现这些共同的命运情况,就必须采取一些行动,以避免单一故障导致重大网络中断。这通常需要将信息注入到设计过程中,增加设计的复杂性,或者将信息注入到控制平面中(请参阅 RFC8001 作为在流量工程控制平面中管理 SRLG 所需的信令类型的示例)。
Once these shared fate situations are discovered, some action must be taken to avoid a single failure from causing a major network outage. This normally requires either injecting information into the design process, adding complexity to the design, or injecting information into the control plane (see RFC8001 as an example of the type of signaling required to manage SRLGs in a traffic-engineered control plane).
本质上,问题归结为这组语句:
Essentially, the problem comes down to this set of statements:
• 虚拟化是一种抽象形式。
• Virtualization is a form of abstraction.
• 抽象删除有关网络状态的信息,以降低复杂性或通过实施策略来提供服务。
• Abstraction removes information about the network state in order to reduce complexity or provide services through the implementation of policy.
• 任何有关网络状态信息的重大减少都会以某种方式降低资源的最佳利用。
• Any nontrivial reduction of information about the network state will reduce the optimal use of resources in some way.
与这三个最终状态相反的唯一方法是通过抽象泄漏信息,因此可以恢复资源的最佳使用 - 在这种情况下,单个链路的故障不会导致通过网络的流量完全故障。那么,唯一的解决方案就是使抽象成为有漏洞的抽象,从而降低抽象在控制状态范围和策略实施方面的有效性。
The only counter to the final state of these three is to leak information through the abstraction, so optimal use of resources can be restored—in this case, the failure of a single link not causing a complete failure of traffic flow through the network. The only solution, then, is to make the abstraction a leaky abstraction, reducing the effectiveness of the abstraction at controlling the scope of state and the implementation of policy.
在网络工程中,将两个路由协议或两个控制平面相互叠加是很常见的。虽然这通常不被认为是虚拟化的一种形式,但事实上,它就是在两个不同的控制平面之间分割状态以控制状态量和状态更改的速率,以降低两个控制平面的复杂性。在网络中运行虚拟覆盖时,这也很常见,因为将有一个底层控制平面提供隧道头端和尾端之间的可达性,并且有一个覆盖控制平面提供虚拟拓扑内的可达性。两个重叠的控制平面有时会以意想不到的方式相互作用。用图9-12来说明。
It is common, in network engineering, to overlay two routing protocols, or two control planes, on top of one another. While this is not often considered a form of virtualization, it is, in fact, just that—splitting state between two different control planes to control the amount of state, and the rate at which state changes, to reduce the complexity of both control planes. This is also common when running virtual overlays in a network, as there will be an underlay control plane providing reachability between the tunnel headend and tailend, and an overlay control plane providing reachability within the virtual topology. Two overlaid control planes will interact in sometimes unexpected ways. Figure 9-12 is used to illustrate.
在图 9-12中:
In Figure 9-12:
• 网络中的每个路由器(包括B、C、D 和E)都运行两个控制平面(或者,如果更简单的话,运行路由协议,即图中的协议1 和协议2)。
• Every router in the network, including B, C, D, and E, is running two control planes (or, if it is simpler, routing protocols, hence protocol 1 and protocol 2 in the illustration).
• 协议 1(覆盖层)依赖协议 2(底层)来提供运行协议 1 的路由器之间的可达性。
• Protocol 1, the overlay, depends on protocol 2, the underlay, to provide reachability between the routers running protocol 1.
• 协议2 没有任何有关连接设备的信息,例如A 和F;这些信息都包含在协议1中。
• Protocol 2 does not have any information about connected devices, such as A and F; this information is all carried in protocol 1.
• 协议 1 比协议 2 需要更长的时间才能收敛。
• Protocol 1 requires much longer to converge than protocol 2.
• 从B 到E 的成本较低的路径是通过C,而不是通过D。
• The lower-cost path from B to E is through C, rather than through D.
给定这组协议,假设图9-12中的C从网络中移除,允许两个控制平面收敛,然后C重新连接到网络。结果会怎样呢?将会发生以下情况:
Given this set of protocols, assume C, in Figure 9-12, is removed from the network, the two control planes are allowed to converge, and then C is reconnected to the network. What will be the result? The following will occur:
• 删除C 后,网络将在B 处与本地路由表中的两条路径重新汇聚:
• After C is removed, the network will reconverge with two paths in the local routing table at B:
• F 可通过E 到达。
• F is reachable through E.
•通过D 可以到达E。
• E is reachable through D.
• Once C is reconnected to the network, protocol 2 will converge quickly.
• 一旦协议 2 重新收敛,从 B 的角度来看,通向 E 的最佳路径将是通过 C。
• Once protocol 2 is reconverged, the best path toward E, from the perspective of B, will be through C.
• 因此,B 现在在本地路由表中有两条路由:
• Therefore, B will now have two routes in the local routing table:
• F 可通过E 到达。
• F is reachable through E.
• E 可通过C 到达。
• E is reachable through C.
• B 将转向新的路由信息,因此将在协议 1 收敛之前通过 C 向 F 发送流量,因此在 C 了解到 F 的最佳路径之前。
• B will shift to the new routing information, and hence will send traffic toward F through C before protocol 1 converges, and hence before C has learned about the best path to F.
• 从B 开始将去往F 的流量转发到C 时,到协议1 收敛时,去往F 的流量将被丢弃。
• From the time when B starts forwarding traffic destined to F to C, and the time when protocol 1 convergences, traffic destined to F will be dropped.
这是一个相当简单的示例,说明了重叠协议以意想不到的方式进行交互。为了解决这个问题,你需要将有关协议 1 收敛状态的信息注入到协议 2 中,或者你必须以某种方式强制这两个协议同时收敛。无论哪种情况,您本质上都是将状态添加回两个协议中,以解决它们收敛时间的差异,并在协议之间创建交互面。
This is a rather simple example of overlaid protocols interacting in an unexpected way. To solve the problem, you need to inject information about the state of the convergence of protocol 1 into protocol 2, or you must somehow force the two protocols to converge at the same time. In either case, you are essentially adding state back into the two protocols to account for their difference in convergence time, as well as creating an interaction surface between the protocols.
笔记
Note
本例描述了IS-IS和BGP,或者OSPF协议和BGP之间的实际融合交互。为了解决这个问题,更快的协议被配置为等到 BGP 收敛后再在本地路由表中安装任何路由。
This example describes the actual convergence interaction between IS-IS and BGP, or the Open Shortest Path First (OSPF) protocol and BGP. To solve this problem, the faster protocol is configured to wait until BGP has converged before installing any routes in the local routing table.
网络虚拟化是工程师手中的重要工具,可以简化设计并解决其他无法解决的问题。所有虚拟化解决方案至少需要两个要素来解决虚拟化带来的问题:
Network virtualization is an important tool in the hands of the engineer to simplify designs and solve otherwise unsolvable problems. All virtualization solutions require at least two elements to solve the problems virtualization poses:
• 通过网络传输流量的某种方式,以便可以将流量分离到虚拟拓扑中
• Some way to tunnel traffic through a network so traffic can be separated out into a virtual topology
• 发现和通告虚拟拓扑中可达性的某种方式,以及将流量引入虚拟拓扑的某种方式
• Some way to discover and advertise reachability across the virtual topology, and some way to draw traffic into the virtual topology
网络工程师需要注意复杂性和虚拟化之间有许多有趣且常常是意想不到的交互点。所有技术都涉及一种或另一种权衡,因此工程师在使用虚拟化技术时应该意识到并有意寻求这些权衡。
There are a number of interesting, and often unexpected, interaction points between complexity and virtualization that network engineers need to be aware of. All technologies involve tradeoffs of one kind or another, so engineers should be aware of, and intentionally seek out, these tradeoffs when working with virtualization technologies.
网络世界中可用的虚拟化技术的数量有时似乎几乎是无限的。正如网络工程师有时会说的那样:“请接受我的隧道协议;总是有很多东西可以去。”
The number of virtualization technologies available in the network world almost seems to be without limit sometimes. As network engineers sometimes say: “Please, take my tunneling protocols; there are always plenty to go around.”
布特罗斯、萨米、阿里·萨贾西、萨米尔·萨拉姆、约翰·德雷克和豪尔赫·拉巴丹。以太网 VPN 中的虚拟专线服务支持。征求意见 8214。RFC 编辑,2017。doi:10.17487/RFC8214。
Boutros, Sami, Ali Sajassi, Samer Salam, John Drake, and Jorge Rabadan. Virtual Private Wire Service Support in Ethernet VPN. Request for Comments 8214. RFC Editor, 2017. doi:10.17487/RFC8214.
Deering、Steve E. 博士和 Robert M. Hinden。互联网协议第 6 版 (IPv6) 规范。征求意见 8200。RFC 编辑,2017。doi:10.17487/RFC8200。
Deering, Dr. Steve E., and Robert M. Hinden. Internet Protocol, Version 6 (IPv6) Specification. Request for Comments 8200. RFC Editor, 2017. doi:10.17487/RFC8200.
Drake、John、Wim Henderickx、Ali Sajassi、Rahul Aggarwal、Nabil N. Bitar 博士、Aldrin Isaac 和 Jim Uttaro。基于 BGP MPLS 的以太网 VPN。征求意见 7432。RFC 编辑,2015。doi:10.17487/RFC7432。
Drake, John, Wim Henderickx, Ali Sajassi, Rahul Aggarwal, Dr. Nabil N. Bitar, Aldrin Isaac, and Jim Uttaro. BGP MPLS-Based Ethernet VPN. Request for Comments 7432. RFC Editor, 2015. doi:10.17487/RFC7432.
法雷尔、阿德里安、奥卢费米·科莫拉夫和安川圣翔。MPLS-TE 核心网络中的扩展问题分析。征求意见 5439。RFC 编辑,2009。doi:10.17487/RFC5439。
Farrel, Adrian, Olufemi Komolafe, and Seisho Yasukawa. An Analysis of Scaling Issues in MPLS-TE Core Networks. Request for Comments 5439. RFC Editor, 2009. doi:10.17487/RFC5439.
菲尔斯菲尔斯、克拉伦斯、克里斯·米切尔森和科坦·塔劳利卡。分段路由第 I 部分。第一版。CreateSpace独立出版平台,2017。
Filsfils, Clarence, Kris Michielsen, and Ketan Talaulikar. Segment Routing Part I. 1st edition. CreateSpace Independent Publishing Platform, 2017.
菲尔斯菲尔斯、克拉伦斯、斯特凡诺·普雷维迪、艾哈迈德·巴桑迪、布鲁诺·德克莱恩、斯蒂芬·利特科夫斯基和罗布·沙基尔。“使用 MPLS 数据平面进行分段路由。” 互联网草案。互联网工程任务组,2017 年 6 月。https: //datatracker.ietf.org/doc/html/draft-ietf-spring-segment-routing-mpls-10。
Filsfils, Clarence, Stefano Previdi, Ahmed Bashandy, Bruno Decraene, Stephane Litkowski, and Rob Shakir. “Segment Routing with MPLS Data Plane.” Internet-Draft. Internet Engineering Task Force, June 2017. https://datatracker.ietf.org/doc/html/draft-ietf-spring-segment-routing-mpls-10.
菲尔斯菲尔斯、克拉伦斯、斯特凡诺·普雷维迪、布鲁诺·德克莱恩、斯蒂芬·利特科夫斯基和罗布·沙基尔。“分段路由架构。” 互联网草案。互联网工程任务组,2017 年 6 月。https ://datatracker.ietf.org/doc/html/draft-ietf-spring-segment-routing-12。
Filsfils, Clarence, Stefano Previdi, Bruno Decraene, Stephane Litkowski, and Rob Shakir. “Segment Routing Architecture.” Internet-Draft. Internet Engineering Task Force, June 2017. https://datatracker.ietf.org/doc/html/draft-ietf-spring-segment-routing-12.
菲尔斯菲尔斯、克拉伦斯、斯特凡诺·普雷维迪、布鲁诺·德克莱恩和罗布·夏基尔。“SPRING 网络中的弹性用例。” 互联网草案。互联网工程任务组,2017 年 5 月。https ://datatracker.ietf.org/doc/html/draft-ietf-spring-resiliency-use-cases-11。
Filsfils, Clarence, Stefano Previdi, Bruno Decraene, and Rob Shakir. “Resiliency Use Cases in SPRING Networks.” Internet-Draft. Internet Engineering Task Force, May 2017. https://datatracker.ietf.org/doc/html/draft-ietf-spring-resiliency-use-cases-11.
盖因,吕克·德。MPLS 基础知识。第一版。印第安纳州印第安纳波利斯:思科出版社,2006 年。
Ghein, Luc De. MPLS Fundamentals. 1st edition. Indianapolis, IN: Cisco Press, 2006.
蒙日、安东尼奥·桑切斯和克日什托夫·格热戈日·萨科维奇。SDN 时代的 MPLS:互操作场景使网络扩展到新服务。第一版。北京:奥莱利传媒,2016。
Monge, Antonio Sanchez, and Krzysztof Grzegorz Szarkowicz. MPLS in the SDN Era: Interoperable Scenarios to Make Networks Scale to New Services. 1st edition. Beijing: O’Reilly Media, 2016.
奥康纳,达伦。第一天:面向企业工程师的 MPLS。瞻博网络图书,2014 年。
O’Connor, Darren. Day One: MPLS for Enterprise Engineers. Juniper Networks Books, 2014.
奥戴尔、迈克尔·D.、约瑟夫·马尔科姆、吉姆·麦克马纳斯、丹尼尔·奥杜什和约翰逊·阿戈布阿。MPLS 上的流量工程的要求。征求意见 2702。RFC 编辑,1999。doi:10.17487/RFC2702。
O’Dell, Michael D., Joseph Malcolm, Jim McManus, Daniel O. Awduche, and Johnson Agogbua. Requirements for Traffic Engineering Over MPLS. Request for Comments 2702. RFC Editor, 1999. doi:10.17487/RFC2702.
Previdi、Stefano、Clarence Filsfils、Ahmed Bashandy、Hannes Gredler、Stephane Litkowski、Bruno Decraene 和 Jeff Tantsura。“分段路由的 IS-IS 扩展。” 互联网草案。互联网工程任务组,2017 年 6 月。https: //datatracker.ietf.org/doc/html/draft-ietf-isis-segment-routing-extensions-13。
Previdi, Stefano, Clarence Filsfils, Ahmed Bashandy, Hannes Gredler, Stephane Litkowski, Bruno Decraene, and Jeff Tantsura. “IS-IS Extensions for Segment Routing.” Internet-Draft. Internet Engineering Task Force, June 2017. https://datatracker.ietf.org/doc/html/draft-ietf-isis-segment-routing-extensions-13.
普雷维迪、斯特凡诺、克拉伦斯·菲尔斯菲尔斯、保罗·马特斯、埃里克·C·罗森和史蒂文·林。“BGP 中的广告段路由策略。” 互联网草案。互联网工程任务组,2017 年 7 月。https: //datatracker.ietf.org/doc/html/draft-ietf-idr-segment-routing-te-policy-00。
Previdi, Stefano, Clarence Filsfils, Paul Mattes, Eric C. Rosen, and Steven Lin. “Advertising Segment Routing Policies in BGP.” Internet-Draft. Internet Engineering Task Force, July 2017. https://datatracker.ietf.org/doc/html/draft-ietf-idr-segment-routing-te-policy-00.
Previdi、Stefano、Clarence Filsfils、Kamran Raza、John Leddy、Brian Field、Daniel Voyer、Daniel Bernier 等。“IPv6 分段路由标头 (SRH)。” 互联网草案。互联网工程任务组,2017 年 7 月。https ://datatracker.ietf.org/doc/html/draft-ietf-6man-segment-routing-header-07。
Previdi, Stefano, Clarence Filsfils, Kamran Raza, John Leddy, Brian Field, Daniel Voyer, Daniel Bernier, et al. “IPv6 Segment Routing Header (SRH).” Internet-Draft. Internet Engineering Task Force, July 2017. https://datatracker.ietf.org/doc/html/draft-ietf-6man-segment-routing-header-07.
彼得·普塞纳克、施拉达·赫格德、克拉伦斯·菲尔斯菲尔斯和阿卡迪·古尔科。“ISIS 分段路由灵活算法。” 互联网草案。互联网工程任务组,2017 年 7 月。https: //datatracker.ietf.org/doc/html/draft-hegdeppsenak-isis-sr-flex-algo-00。
Psenak, Peter, Shraddha Hegde, Clarence Filsfils, and Arkadiy Gulko. “ISIS Segment Routing Flexible Algorithm.” Internet-Draft. Internet Engineering Task Force, July 2017. https://datatracker.ietf.org/doc/html/draft-hegdeppsenak-isis-sr-flex-algo-00.
Psenak、Peter、Stefano Previdi、Clarence Filsfils、Hannes Gredler、Rob Shakir、Wim Henderickx 和 Jeff Tantsura。“分段路由的 OSPF 扩展。” 互联网草案。互联网工程任务组,2017 年 8 月。https: //datatracker.ietf.org/doc/html/draft-ietf-ospf-segment-routing-extensions-19。
Psenak, Peter, Stefano Previdi, Clarence Filsfils, Hannes Gredler, Rob Shakir, Wim Henderickx, and Jeff Tantsura. “OSPF Extensions for Segment Routing.” Internet-Draft. Internet Engineering Task Force, August 2017. https://datatracker.ietf.org/doc/html/draft-ietf-ospf-segment-routing-extensions-19.
———。“分段路由的 OSPFv3 扩展。” 互联网草案。互联网工程任务组,2017 年 3 月。https ://datatracker.ietf.org/doc/html/draft-ietf-ospf-ospfv3-segment-routing-extensions-09。
———. “OSPFv3 Extensions for Segment Routing.” Internet-Draft. Internet Engineering Task Force, March 2017. https://datatracker.ietf.org/doc/html/draft-ietf-ospf-ospfv3-segment-routing-extensions-09.
萨贾西、阿里、约翰·德雷克、纳比尔·比塔尔、拉维·谢卡尔、吉姆·乌塔罗和维姆·亨德里克斯。“使用 EVPN 的网络虚拟化覆盖解决方案。” 互联网草案。互联网工程任务组,2017 年 3 月。https ://datatracker.ietf.org/doc/html/draft-ietf-bess-evpn-overlay-08。
Sajassi, Ali, John Drake, Nabil Bitar, Ravi Shekhar, Jim Uttaro, and Wim Henderickx. “A Network Virtualization Overlay Solution Using EVPN.” Internet-Draft. Internet Engineering Task Force, March 2017. https://datatracker.ietf.org/doc/html/draft-ietf-bess-evpn-overlay-08.
西瓦巴兰、西瓦、克拉伦斯·菲尔斯菲尔斯、杰夫·坦苏拉、维姆·亨德里克斯和乔纳森·哈德威克。“分段路由的 PCEP 扩展。” 互联网草案。互联网工程任务组,2017 年 4 月。https ://datatracker.ietf.org/doc/html/draft-ietf-pce-segment-routing-09。
Sivabalan, Siva, Clarence Filsfils, Jeff Tantsura, Wim Henderickx, and Jonathan Hardwick. “PCEP Extensions for Segment Routing.” Internet-Draft. Internet Engineering Task Force, April 2017. https://datatracker.ietf.org/doc/html/draft-ietf-pce-segment-routing-09.
Tappan、Dan、Yakov Rekhter、Alex Conta、Guy Fedorkow、Eric C. Rosen、Dino Farinacci 和 Tony Li 博士。MPLS 标签栈编码。征求意见 3032。RFC 编辑,2001。doi:10.17487/RFC3032。
Tappan, Dan, Yakov Rekhter, Alex Conta, Guy Fedorkow, Eric C. Rosen, Dino Farinacci, and Dr. Tony Li. MPLS Label Stack Encoding. Request for Comments 3032. RFC Editor, 2001. doi:10.17487/RFC3032.
维斯瓦纳坦、阿伦、埃里克·C·罗森和罗斯·卡伦。多协议标签交换架构。征求意见 3031。RFC 编辑,2001。doi:10.17487/RFC3031。
Viswanathan, Arun, Eric C. Rosen, and Ross Callon. Multiprotocol Label Switching Architecture. Request for Comments 3031. RFC Editor, 2001. doi:10.17487/RFC3031.
张法泰、奥斯卡·冈萨雷斯·德迪奥斯、马特·哈特利、扎法尔·阿里和西里尔·玛加利亚。用于收集共享风险链路组 (SRLG) 信息的 RSVP-TE 扩展。征求意见 8001。RFC 编辑,2017。doi:10.17487/RFC8001。
Zhang, Fatai, Oscar Gonzalez de Dios, Matt Hartley, Zafar Ali, and Cyril Margaria. RSVP-TE Extensions for Collecting Shared Risk Link Group (SRLG) Information. Request for Comments 8001. RFC Editor, 2017. doi:10.17487/RFC8001.
1. 虚拟化与多路复用有何不同?
1. How is virtualization different from mutliplexing?
2. 网络设备(如路由器)中的虚拟接口和VRF转发有什么区别?
2. What is the difference between virtual interface and VRF forwarding in a network device (such as a router)?
3.一些覆盖控制平面在单个协议中包括来自底层和覆盖的可到达目的地。以太网 VPN (eVPN) 就是一个例子,其中底层的 IP 可达性和覆盖层的以太网可达性都由单一协议(边界网关协议)承载。覆盖层和底层可达性如何分离?
3. Some overlay control planes include the reachable destinations from both the underlay and the overlay in a single protocol. An example of this would be Ethernet VPNs (eVPNs), in which both the IP reachability of the underlay and the Ethernet reachability of the overlay are carried in a single protocol, the Border Gateway Protocol. How are the overlay and underlay reachability separated?
4. 绘制一个网络,其中底层控制平面和覆盖控制平面的交互形成一个环路。
4. Draw a network where the interaction of the underlay and overlay control planes create a loop.
5. 在什么情况下您需要邻接 SID,而不仅仅是节点 SID?
5. Under what situation would you need to have adjacency SIDs, rather than just node SIDs?
6. 描述另一种情况,其中以太网覆盖中的 SRLG 不可能被检测到,但当单个链路发生故障时会导致多个虚拟链路发生故障。
6. Describe another situation where an SRLG in an overlay over an Ethernet network would be impossible to detect but would cause multiple virtual links to fail when a single link fails.
7. 研究帧中继中的虚电路。您是否认为这是一种隧道机制?解释。
7. Research virtual circuits in Frame Relay. Would you consider this a tunneling mechanism or not? Explain.
1 . Previdi 等人,“分段路由的 IS-IS 扩展”。
1. Previdi et al., “IS-IS Extensions for Segment Routing.”
2 . Previdi 等人,“BGP 中的广告分段路由策略”。
2. Previdi et al., “Advertising Segment Routing Policies in BGP.”
3 . Sivabalan 等人,“分段路由的 PCEP 扩展”。
3. Sivabalan et al., “PCEP Extensions for Segment Routing.”
当您登录金融或医疗网站并登录时,您应该预料到您检索到的信息不会被您的计算机和服务器之间的路径上的任何人拦截和读取。一个不太明显但同样重要的问题是,您发送到站点的信息在通过网络传输时不应公开更改。
When you log in to a financial or medical website and sign in, you should expect that the information you retrieve cannot be intercepted and read by anyone along the path between your computer and the server. A less obvious, but just as important, problem is the information you send to the site should not be open to change while it is being transported by the network.
但如何保证这些事情呢?这是运输安全可以用来解决的两个领域。本章将考虑传输安全问题空间,然后研究几种解决方案,包括加密。最后,本章将以传输层安全(TLS)规范作为传输层加密的示例。
But how can these things be ensured? These are two of the areas transport security can be used to address. This chapter will consider the transport security problem space, followed by an investigation of several kinds of solutions, including encryption. Finally, this chapter will look at the Transport Layer Security (TLS) specification as an example of transport layer encryption.
安全性通常解决四个问题之一:证明数据在传输过程中没有被更改,防止除预期接收者之外的任何人访问信息,保护使用网络的人的隐私,以及证明信息已经交付(或工作)已经完成了)。第二个问题和第三个问题是防止通过网络对数据进行未经授权的访问并保护用户隐私,它们是相关问题,但将在以下各节中分别处理。最后一个问题是遍历证明问题(类似于其他信息技术环境中面临的工作证明问题),这里不考虑,因为它是一个活跃研究领域,部署的系统很少。
Security generally resolves to one of four problems: proving the data has not been changed in transmission, preventing anyone other than the intended recipient from accessing the information, protecting the privacy of the humans using the network, and proving information has been delivered (or work has been done). The second and third problems, preventing unauthorized access to data as it crosses the network and protecting user privacy, are related problems but will be treated separately in the following sections. The final problem noted, the proof of traversal problem (which is similar to the proof of work problem faced in other information technology contexts), is not considered here, as it is an area of active research with few deployed systems.
如果您登录银行网站并将 100 美元从一个帐户转入另一个帐户,如果实际转帐金额为 1,000 美元,或者帐号发生更改,导致这 100 美元最终转入其他人的帐户,您可能会感到不安。还有许多其他情况需要确保传输的数据与接收的数据相同,例如
If you log in to your bank’s website and transfer $100 from one account to another, you would likely be upset if the amount actually transferred was $1,000 instead, or if the account numbers were changed so the $100 ended up in someone else’s account. There are a number of other situations where making certain the data transmitted is the same as the data received, such as
• 如果您购买一双蓝色鞋子,您不希望收到一套红色鞋子。
• If you purchase a pair of blue shoes, you do not want a set of red ones delivered instead.
• 如果您的医生给您开药方来缓解胃灼热(可能是由于作为网络工程师的工作压力造成的),您不希望收到治疗关节炎的药物(可能是由于打字太多的文件和书籍)。
• If your doctor gives you a prescription for medicine to help your heartburn (probably resulting from the stress of working as a network engineer), you do not want medicine for arthritis (probably from typing so many documents and books) to be delivered.
在很多情况下,接收到的数据必须与传输的数据匹配,并且发起者和/或接收者必须是可验证的。
There are a lot of situations where the data received must match the data transmitted, and the originator and/or receiver must be verifiable.
前面给出的数据保护示例可以更进一步:您不希望别人看到您的帐号、处方或通过网络传输的其他信息。帐号、密码和任何类型的个人身份信息 (PII) 都非常重要,因为这些类型的信息可用于闯入帐户窃取金钱,甚至用于完全窃取某人的身份。
The data protection examples given previously can be taken one step further: you do not want someone to see your account number, prescription, or other information as it is being transported across the network. Account numbers, passwords, and any kind of personally identifiable information (PII) are all very crucial, as these kinds of information can be used to break into accounts to steal money, or even used to steal someone’s identity entirely.
如何保护此类信息?用于防止未经授权的用户(或攻击者;请参阅第 21 章“安全性:更广泛的扫描”,稍后了解攻击要素的完整定义)的主要保护手段是加密。
How can this kind of information be protected? The primary means of protection used to prevent unauthorized users (or attackers; see Chapter 21, “Security: A Broader Sweep,” later for a full definition of the elements of an attack) is encryption.
在全球互联网上拥有隐私不仅是一件好事,而且也是一件好事。这是用户信任系统的要求。本地网络也是如此。如果用户认为自己受到某种方式的监视,他们就不太可能使用该网络。相反,他们可能会使用sneakernet,将信息打印出来并随身携带,而不是通过网络传输。虽然许多人认为隐私不是一个合理的担忧,但这一领域存在许多合理的担忧。
Privacy is not just nice to have on the global Internet; it is a requirement for users to trust the system. This is true of local networks, as well; if users believe they are being spied on in some way, they are not likely to use the network. Rather, they are likely to use sneakernet, printing information out and hand-carrying it, rather than transferring it over the network. While many people believe privacy is not a valid concern, there are many valid concerns in this area.
例如,信息管理领域的一句俗话是“知识就是力量”。了解计算机或网络可以让您对计算机、网络或系统的能力有所了解。例如,假设一家银行为特定数据库表配置了自动备份;当表中账户余额发生特定金额变化时,自动开始备份。这似乎是一种完全合理的备份作业,但它确实涉及一定量的数据消耗。
For instance, a common saying in the information management field is knowledge is power. Knowing about a computer or network gives you some measure of power over the computer, network, or system. For instance, assume a bank configures an automated backup for a particular database table; when the balances in the account held in the table change by a particular amount, the backup is kicked off automatically. This might seem like a perfectly reasonable sort of backup job, but it does involve some amount of data exhaust.
笔记
Note
数据耗尽是有关人的身体运动的信息或可用于推断这些人或该信息正在做什么的信息。例如,如果您每天早上总是走同一条路线去上班,那么有人可以推断,一旦您完成了行程的一小部分,再加上一天中的某个时间,您就会去上班。网络世界中也存在同样的数据消耗;如果每次在一天中的特定时间,通过网络传输一定大小的特定数据,并且它恰好与特定事件重合,例如在两个帐户之间转账,那么当该特定数据出现,传输必须正在进行。浏览、电子邮件历史记录和其他在线操作都会留下数据耗尽,
Data exhaust is information about the physical movements of people or information that can be used to infer what those people or that information is doing. For instance, if you always take the same route to work every morning, someone can infer, once you have made some small part of the trip, combined with a time of day, you are going to work. The same sorts of data exhaust exist in the network world; if, every time, at a particular time of day, a particular piece of data of a certain size is transmitted through the network, and it happens to coincide with a particular event, such as transferring money between two accounts, then when this particular data appears, the transfer must be taking place. Browsing, email history, and other online actions all leave data exhaust, which can sometimes be used to infer the contents of a data stream even if the stream is encrypted.
这里的漏洞是:如果威胁行为者将备份与帐户价值的变化放在一起,那么该人将具体知道帐户活动的模式是什么。足够多的此类线索可以开发成一整套攻击计划。
The vulnerability here is: if a threat actor puts the backup together with the change in account value, that person will know specifically what the pattern of account activity is. Enough clues of this sort can be developed into an entire set of attack plans.
人也是如此;了解人可以让你有能力在特定方向上影响人们。虽然对人的影响不如对机器的影响那么大,但将一个人的权力交给另一个人总是会带来需要谨慎处理的道德含义。
The same is true of people; having knowledge about people can give you some ability to influence people in specific directions. While the influence over people is not as great as the influence over machines, handing one person power over another always carries moral implications that need to be handled carefully.
虽然前面几节中描述的安全和隐私问题的每个解决方案通常都涉及硬数学,但本节将(尝试)描述无需数学的解决方案。鼓励想要了解更多有关此处考虑的机制的读者查看本章末尾的“进一步阅读”部分,以获取描述特定类型的加密算法和所涉及数学的资源。
While every solution to the security and privacy issues described in the preceding sections generally involves hard math, this section will (attempt to) describe the solutions without the math. Readers who would like to learn more about the mechanisms considered here are encouraged to look at the “Further Reading” section at the end of the chapter for resources describing specific kinds of encryption algorithms and the math involved.
加密采用信息块(明文)并使用某种形式的数学运算对其进行编码以模糊文本,从而产生密文。要恢复原始纯文本,必须反转数学运算。虽然加密通常被视为一种数学构造,但有时更容易将其视为具有根据所使用的密钥而变化的替换表的替换密码。如图 10-1所示。
Encryption takes a block of information (the plaintext) and encodes it using some form of mathematical operation to obscure the text, resulting in the ciphertext. To recover the original plain text, the mathematical operations must be reversed. While encryption is often approached as a mathematical construct, it is sometimes easier to start by thinking of it as a substitution cipher with a substitution table that varies based on the key used. Figure 10-1 illustrates.
图 10-1显示了一个四位信息块——一个简单的例子,但仍然有助于说明这一点。加密过程在概念上是一系列直接替换:
Figure 10-1 shows a four-bit block of information—a trivial example but still useful to illustrate the point. The encryption process is conceptually a series of straight substitutions:
• 如果在原始数据块(明文)中找到0001 并且密钥1 正在使用,则将1010 替换到实际传输的流(密文)中。
• If 0001 is found in the original block of data (the plaintext) and key 1 is in use, then 1010 is substituted into the actual transmitted stream (the ciphertext).
• 如果在明文中发现0010 并且密钥1 正在使用,则将0100 替换到传输的数据中。
• If 0010 is found in the plaintext and key 1 is in use, then 0100 is substituted into the transmitted data.
• 如果在明文中发现0001 并且密钥2 正在使用,则将0000 替换到传输的数据中。
• If 0001 is found in the plaintext and key 2 is in use, then 0000 is substituted into the transmitted data.
• 如果在明文中发现0110 并且密钥2 正在使用,则将1001 替换为传输的数据。
• If 0110 is found in the plaintext and key 2 is in use, then 1001 is substituted into the transmitted data.
将一个位块替换为另一个位块的过程称为变换。这些转换必须是对称的:它们不仅必须允许将明文加密为密文,而且还必须允许从密文恢复(未加密)明文。在替换表中,此过程涉及在表的密文一侧查找密钥并替换等效的明文。
The process of substituting one block of bits for another is called a transform. These transforms must be symmetrical: they must not only allow the plaintext to be encrypted to the ciphertext, but they must also allow the plaintext to be recovered (unencrypted) from the ciphertext. In a substitution table, this process involves looking up the key on the ciphertext side of the table and substituting the plaintext equivalent.
替换表的大小由块的大小或一次编码的位数决定。例如,如果使用 128 位块,则查找表将需要有 2 128个条目——这确实是一个非常大的数字。这种空间仍然可以通过有效的算法快速搜索到,因此该块必须具有一些其他特征而不仅仅是大。
The size of the substitution table is determined by the size of the block, or the number of bits encoded at one time. If a 128-bit block is used, for instance, the lookup table would need to have 2128 entries—a very large number indeed. This kind of space can be still be searched by an efficient algorithm quickly, so the block must have some other features than simply being large.
首先是替换块的密文一侧必须尽可能随机。为了使转换达到理想效果,在明文中找到的任何模式都不能用于在生成的密文中进行分析。无论输入是什么,密文输出必须尽可能接近随机数字集。
The first is that the ciphertext side of the substitution block must be as random as possible. For a transform to be ideal, any pattern found in the plaintext must not be available for analysis in the resulting ciphertext. The ciphertext output must appear to be as close to a random set of numbers as possible, no matter what the input is.
第二个是替换块应该尽可能大。替换块越随机、越大,从明文和密文返回来发现所使用的替换模式就越困难。要对使用 128 位块大小的替换执行强力攻击,攻击者必须将明文块中的2 128个条目与密文替换块中的 2 128个条目关联起来 — 如果信息仅使用由于来自原始 128 位空间的小(或稀疏)可能条目集,几乎没有什么实用的方法可以使相关性足够快以使此类攻击实用——假设加密发送者足够频繁地更改其密钥。
The second is the substitution block should be as large as is practically possible. The more random and larger the substitution block is, the harder it is to work back from the plaintext and ciphertext to discovering the substitution pattern being used. To perform a brute-force attack against a substitution using a 128-bit block size, the attacker must correlate as many of the 2128 entries in the plaintext block with the 2128 entries in the ciphertext substitution block—if the information only uses a small (or sparse) set of possible entries from the original 128-bit space, there is little practical way to make the correlation fast enough to make this sort of attack practical—given the encrypting sender changes its key often enough.
笔记
Note
当涉及到区块的大小时,存在收益递减规律;在某些时候,增加块大小并不会提高密码隐藏信息的有效性。
There is a law of diminishing returns when it comes to the size of the block; at some point, increasing the block size does not increase the effectiveness of the cipher at hiding information.
密度最好用一个例子来解释。假设您在英语中使用直接替换密码,其中每个字母都替换为字母表中偏移四步的字母。在这种(简单的)密码中:
Density is best explained with an example. Assume you are using a straight substitution cipher in the English language, where each letter is replaced by the letter offset by four steps in the alphabet. In this sort of (trivial) cipher:
• Each A would be replaced by an E.
• 每个B 将被F 替代。
• Each B would be replaced by an F.
• 每个C 将被G 替代。
• Each C would be replaced by a G.
• ETC。
• Etc.
现在尝试使用此转换加密两个不同的句子:
Now try encrypting two different sentences using this transform:
• 天空是蓝色的 == XLI WOC MW FPYI
• THE SKY IS BLUE == XLI WOC MW FPYI
• 敏捷的棕色狐狸跳过了懒狗 == XLI UYMGO FVSAR JSB NYQTIH SZIV XLI PEDC HSK
• THE QUICK BROWN FOX JUMPED OVER THE LAZY DOG == XLI UYMGO FVSAR JSB NYQTIH SZIV XLI PEDC HSK
对于试图弄清楚该句子的密文版本与明文版本有何关系的攻击者来说,第一个句子在 26 个可能的字母空间中呈现了 9 个匹配的字母对。您很有可能从这个小样本中猜出正确的变换是什么(向右移动四步),但是可能存在一些“技巧”,导致使用此变换加密的未来消息无法解密正确。然而,第二个句子是一个众所周知的例子,它包含英语字母表中所有可能的字母。可以针对整个输入和输出范围内的每个可能值来验证变换,从而使变换的发现变得微不足道。
For the attacker trying to figure out how the ciphertext version of the sentence relates to the plaintext version, the first sentence presents 9 matching pairs of letters out of the space of 26 possible letters. There is a good chance you can guess what the correct transform is—move four steps to the right—from this small sample, but it is possible there is some “trick” involved that causes future messages encrypted using this transform to fail to be unencrypted correctly. The second sentence is, however, a well-known example of a sentence containing every possible letter in the English alphabet. The transform can be validated against every possible value in the entire input and output range, making the discovery of the transform trivial.
在此示例中,第一个句子的密度低于第二个句子。在实际的密码系统中,总体思路是在 2 128或 2 512个可能符号的空间中仅使用数千个可能的符号,这会创建一个密度较低的信息集。在某些时候,密度变得足够低,变换足够复杂,密文足够随机,以至于没有实际的方法来计算输入(明文)和输出(密文)之间的关系。
In this example, the first sentence would be less dense than the second. In real cryptographic systems, the general idea would be to use just several thousand possible symbols out of a space of 2128 or 2512 possible symbols, which creates a much less dense information set to work with. At some point, the density becomes low enough, the transform complex enough, and the ciphertext random enough, that there is no practical way to compute the relationship between the input (the plaintext) and the output (the ciphertext).
在现实生活中,替换块不是以这种方式预先计算的。相反,密码函数用于实时计算替代值。这些加密函数采用块大小的输入(明文),执行转换,并输出正确的密文。键是第二个输入,它修改转换的输出,因此每个键都会导致转换产生不同的输出。如果密钥大小为 128 位,块大小为 256 位,则变换有 2 128 × 2 256种可能的输出组合。如图 10-2所示。
In real life, the substitution blocks are not precomputed in this way. Rather, a cryptographic function is used to calculate the substitution value in real time. These cryptographic functions take a block-sized input, the plaintext, perform the transform, and output the correct ciphertext. The key is a second input that modifies the output of the transform so each key causes the transform to produce a different output. If the key size is 128 bits, and the block size is 256 bits, there are 2128 × 2256 possible output combinations from the transform. Figure 10-2 illustrates.
图10-2中,每个替换表都是块大小;如果块大小为 256 位,则每个表中有 2 256 种可能的替换。每个密钥生成一个新表,因此如果密钥是128位,则有2 128个可能的表。攻击这种加密系统有两种一般方法。
In Figure 10-2, each substitution table is the block size; if the block size is 256 bits, then there are 2256 possible substitutions in each table. Each key generates a new table, so if the key is 128 bits, then there are 2128 possible tables. There are two general ways to attack such an encryption system.
攻击此类加密系统的第一种方法是尝试将每个可能的输入值映射到每个可能的输出值,从而暴露整个替换表。如果输入仅代表一小组可能的输入(表被稀疏使用,或者更准确地说是稀疏数组),则此任务几乎是不可能的。如果用户更改了她的密钥,并因此更改了可能的表集中的特定表,那么通常没有办法比更改密钥更快地执行此映射。
The first way to attack this type of encryption system is to try to map every possible input value to every possible output value, revealing the entire substitution table. If the input only ever represents a small set of the possible inputs (the table is sparsely used, or is a sparse array, more precisely), this task is nearly impossible. If the user changes her key, and hence the particular table among the possible set of tables, often enough, there is no way to perform this mapping faster than the key is changed.
即使大块与变换相结合以产生近乎随机的输出,仍然存在潜在的弱点——换句话说,即使变换接近理想。如果你在一个房间里收集 23 个人,那么其中两个人的生日很可能相同,但这似乎不合理,因为一个人每年可能出生 365 天(不包括闰年)。看似应该发生的事情与实际发生的事情之间存在差异的原因是:在现实世界中,人们的生日集中在一年中的极少数日子里。那么,输入数据是一组中等大的可能值中非常密集的“点”。发生这种情况时,数据的稀疏性可能会产生不利影响加密系统。如果定期在较大的数据集中重复一小部分数据,则攻击者可以只关注最常用的替换,并可能发现足够多的消息内容,从而合理地恢复大部分含义。
There are still potential weaknesses even in large blocks combined with transforms to produce nearly random output—in other words, even if the transform is close to ideal. If you collect 23 people in a single room, there is a high probability two of them will have the same birthday—but this seems irrational because there are 365 potential days (not counting leap years) on which a person could be born each year. The reason for the disparity between what appears should happen and what does happen is this: in the real world, people’s birthdays are clustered on a very small number of days throughout the year. The input data, then, is a very dense “spot” in a moderately large set of possible values. When this happens, the sparseness of the data can work against the encryption system. If a small set of data is repeated in the larger set on a regular basis, the attacker can focus on just the substitutions used most often and potentially discover the contents of enough of the message to make recovery of most of the meaning reasonably possible.
攻击此类加密系统的第二种方法是攻击变换本身——密码函数。请记住,这些大型替换表通常无法生成、存储和传输,因此使用某种形式的加密函数将明文块作为输入并生成密文块作为输出。如果你能发现这个变换函数,那么你就可以像发送器和接收器一样计算输出,并实时解密明文。
The second way to attack an encryption system of this kind is to attack the transform itself—the cryptographic function. Remember these large substitution tables are often impossible to generate, store, and transport, so some form of crypto-graphic function is used to take a block of plaintext as an input and generate a block of ciphertext as the output. If you could discover this transform function, then you can calculate the output in the same way the transmitter and receiver are, and unencrypt the plaintext in real time.
在现实世界中,这个问题变得更加复杂
In the real world, this problem is made more complex by
• Kerckhoffs 原则,该原则规定变换本身不能是秘密。相反,只有用于选择可能的表中的哪个表的密钥应该保密。
• Kerckhoffs’ principle, which states the transform itself must not be a secret. Rather only the key used to select which table among the possible tables should be kept secret.
• 出于各种原因,有时至少可以从正在进行的加密数据传输中恢复一些明文和密文——可能是错误,或者加密的目的可能是验证文本,而不是防止文本被读取。
• At least some plaintext and ciphertext can sometimes be recovered from an ongoing encrypted data transmission for various reasons—perhaps a mistake, or perhaps the point of the encryption is to verify the text, rather than keeping the text from being read.
鉴于这些限制,有几个关键点需要考虑:
Given these restrictions, there are several key points to consider:
• 从明文、密文和密码函数(变换)计算密钥的难度一定非常高。
• The difficulty of computing the key from the plaintext, ciphertext, and cryptographic function (transform) must be very high.
• 密码函数输出的随机性必须非常高,以降低暴力攻击(仅尝试空间中每个可能的密钥)成功的可能性。
• The randomness of the output of the cryptographic function must be very high, to reduce the possibility of brute-force attacks—just trying every possible key in the space—being successful.
• 密钥空间必须足够大,以防止暴力攻击成功。
• The key space must be large, again to prevent brute-force attacks from being successful.
加密函数的质量取决于该函数从几乎任何输入生成接近随机输出的能力,从而防止攻击者发现正在使用哪个密钥,即使他们同时拥有明文和密文。那么,密码函数通常使用某种形式的最难计算的问题之一。特别经常使用的一种方法是计算非常大的素数的因子。
The quality of a cryptographic function is determined by the ability of the function to produce as close to a random output from virtually any input in a way so an attacker is prevented from discovering which key is being used, even though they have both the plaintext and the ciphertext. Cryptographic functions, then, normally use some form of one of the most difficult problems to calculate. One in particular that is often used is computing the factors of very large prime numbers.
如果您使用 128 位块并且有 56 位数据要传输,会发生什么情况?在这种情况下,最自然的做法是用一些数字填充明文;最有可能是全 0 或全 1。输出的质量在某种程度上取决于输入的稀疏性;用作输入的数字范围越少,加密函数的输出就越可预测。在这种情况下,使用尽可能接近随机的填充非常重要;有一个完整的研究领域围绕如何填充明文块以“帮助”加密函数生成尽可能接近随机的密文。
What happens if you are using a 128-bit block, and you have 56 bits of data to transport? The most natural thing to do in this situation would be to pad the plaintext with some number; most likely all 0s or all 1s. The quality of the output is dependent, to some degree, on the sparseness of the input; the fewer the range of numbers used as an input, the more predictable the output of the cryptographic function will be. In this case, it is important to use padding that is as close to random as possible; there is an entire field of study around how to pad blocks of plaintext to “help” the cryptographic function produce ciphertext that is as close to random as possible.
可以通过加密函数多次处理相同的信息。例如,如果您有一个 128 位块和一个 128 位密钥,您可以
It is possible to process the same information through a cryptographic function multiple times. For instance, if you have a 128-bit block and a 128-bit key, you can
• 获取明文并使用密钥计算密文;将此称为ct1。
• Take the plaintext and, using the key, calculate a ciphertext; call this ct1.
• 获取ct1,并使用密钥计算第二轮密文;将此称为ct2。
• Take ct1 and, using the key, calculate a second-round ciphertext; call this ct2.
• 获取ct2,并使用密钥计算第三轮密文;将此称为ct3。
• Take ct2 and, using the key, calculate a third-round ciphertext; call this ct3.
实际传输的密文将是最终的ct3。这个过程要完成什么?请记住,加密过程的质量与输出相对于输入的随机性有关。在许多情况下,每一轮都会稍微增加一点随机性。存在一个收益递减点这个流程; 通常在第三轮之后,数据不会变得“更加随机”,因此更多的轮次本质上只是浪费处理能力和时间而几乎没有收益。
The actual transmitted ciphertext would be the final ct3. What does this process accomplish? Remember the quality of the encryption process is related to the randomness of the output against the input. Each round will, in many situations, increase the randomness just a bit more. There is a point of diminishing returns in this process; normally after the third round, the data is not going to become “any more random,” and hence more rounds are essentially just wasting processing power and time for very little gain.
有一类加密函数可以使用两个不同的密钥将明文转换为密文,然后再转换回来。当您希望能够使用一个密钥加密数据块并允许其他人使用不同的密钥解密数据时,此功能非常有用。您保密的密钥称为私钥,而您提供给他人或公开的密钥称为公钥。
There is a class of cryptographic functions that can transform the plaintext into ciphertext, and back, using two different keys. This capability is useful when you want to be able to encrypt a block of data with one key and allow someone else to unencrypt the data using a different key. The key you keep secret is called the private key, and the key you give to others, or publish, is called the public key.
例如,为了证明您是特定文件的实际发送者,您可以使用您的私钥加密该文件。现在,任何拥有您的公钥的人都可以解密该文件,该文件只能由您发送。您通常不会使用私钥加密整个数据块(事实上,大多数使用密钥对的系统都被设计为您不能这样做);相反,签名是使用您的私钥创建的,可以使用您的公钥进行验证。为了确保只有您想要读取某些内容的人可以,您可以使用她的公钥加密一些数据,然后发布它,并且只有拥有正确私钥的人才能解密它。
To prove you are the actual sender of a particular file, for instance, you can encrypt the file with your private key. Now anyone with your public key can unencrypt the file, which could only have been sent by you. You would not normally encrypt the entire block of data with your private key (in fact most systems using key pairs are designed so you cannot do this); rather a signature is created using your private key that can be verified using your public key. To ensure only the person you want to read something can, you can encrypt some data with her public key, publish it, and only the person with the correct private key can unencrypt it.
这样的系统称为公钥密码术(有时工程师选择的名称可能有点太明显了)或非对称密码术。在公钥密码学中,公钥通常被“释放到野外”;任何有权访问密钥服务器或其他来源的人都可以查找它。
Such systems are called public key cryptography (sometimes the names engineers choose are, perhaps, a little too obvious), or asymmetric cryptography. In public key cryptography, the public key is often “released into the wild;” it is something anyone with access to a key server or some other source can look up.
公钥加密的替代方案是对称密钥加密。在对称密钥加密中,发送者和接收者共享一个用于加密和解密数据的密钥(共享密钥)。鉴于共享秘密(显然)难以创建和使用,为什么要使用对称密钥加密技术?在对称加密和公钥/私钥加密之间进行选择时,需要考虑两个基本权衡:
The alternative to public key cryptography is symmetric key cryptography. In symmetric key cryptography, the sender and receiver share a single key that is used to both encrypt and unencrypt the data (the shared secret). Given shared secrets are (obviously) difficult to create and use, why is symmetric key cryptography ever used? There are two basic tradeoffs to consider when choosing between symmetric and public/private key cryptography:
•处理复杂性:公钥密码系统通常需要更多的处理能力来加密和解密传输的数据。对称密钥系统通常更容易开发和部署,不需要大量的处理能力和时间。因此,公钥加密通常用于加密非常少量的数据,例如私钥(请参阅下一节中的示例)。
• Processing complexity: Public key cryptography systems generally require a good deal more processing power to encrypt and unencrypt the transmitted data. Symmetric key systems are generally much easier to develop and deploy in a way that does not require large amounts of processing power and time. Because of this, public key cryptography is often used to encrypt very small amounts of data, such as a private key (see the example in the following section).
•安全性:公钥密码术通常需要一组有些独特的数学转换机制。对称密钥系统往往具有更广泛的可用变换,这些变换也更复杂,因此更安全(它们在输出中提供更多随机性,因此更难破解)。
• Security: Public key cryptography generally requires a somewhat unique set of mathematical transform mechanisms. Symmetrically keyed systems tend to have a wider range of available transforms that are also more complex and hence more secure (they provide more randomness in the output and hence are harder to break).
There is a place for both kinds of systems, given these tradeoffs and real-world requirements.
一些最早的密码系统涉及将纸包裹在特定尺寸的圆柱体上。圆柱体必须以某种方式在加密通信的双方之间传送,而不被敌人捕获。近年来,密钥垫被物理地携带在加密系统的两个端点之间。其中一些是经过安排的,以便特定的页面将使用一段时间,然后撕掉,安全地销毁,并在第二天用新页面替换。其他的设计是这样的:便笺簿中的每一页都将用于加密一条消息,此时该页将被撕下并替换——一次性便笺簿。
Some of the earliest cryptographic systems involved wrapping paper around a cylinder of a specific size; the cylinder had to be somehow carried between the two parties to the encrypted communication without being captured by an enemy. In more recent years, pads of keys were physically carried between the two end points of an encrypted system. Some of these were arranged so a particular page would be used for a certain time period and then ripped out, securely destroyed, and replaced by a new page for the next day. Others were designed so each page in the pad would be used to encrypt one message, at which point the page would be ripped out and replaced—a one-time pad.
笔记
Note
一次性密码本的概念已被引入现代世界,身份验证系统允许用户创建一个代码,该代码使用一次,然后丢弃,并在用户下次尝试身份验证时由新代码替换。任何依赖于一次使用的代码的系统仍然被称为一次性密码本。
The concept of a one-time pad has been carried into the modern world with authentication systems that allow the user to create a code that is used once, and then discarded, to be replaced by a new code the next time the user tries to authenticate. Any system that relies on a code that is used once is still called a one-time pad.
在现代世界中,您可以通过其他方式交换加密材料,无论是使用共享密钥还是检索私钥。
In the modern world, there are other ways you can exchange cryptographic material, whether it is using a shared secret key or retrieving a private key.
在密码学中,很多时候使用简单的例子更容易解释某些东西是如何工作的。在下面的解释中,Fish 和 Jeff 将是两个尝试交换安全信息的用户,Fish 是发起者和发送者,Jeff 是接收者。
Many times in cryptography, it is easier to explain how something works using trivial examples. In the following explanations, Fish and Jeff will be two users who are trying to exchange secure information, with Fish being the initiator and sender, and Jeff being the receiver.
Fish 希望以只有 Jeff 可以阅读的方式向 Jeff 发送消息;为此,她需要 Jeff 的公钥(请记住,她不应该访问 Jeff 的私钥)。她从哪里可以得到这些信息?她可以
Fish would like to send a message to Jeff in a way that only Jeff can read it; to do this, she needs Jeff’s public key (remember she should not have access to Jeff’s private key). Where can she get this information? She could
• 直接向Jeff 询问。这看起来似乎很简单,但在现实生活中可能非常困难。例如,她如何确定自己确实在与杰夫沟通?
• Ask Jeff for it directly. This might seem simple to do, but it could be very difficult in real life. How, for instance, can she be certain she is actually communicating with Jeff?
• 在公共密钥数据库(密钥服务器)中查找Jeff 的公共密钥。同样,这似乎很简单,但她怎么知道她找到了合适的人,或者有人没有在这个特定的服务器上为杰夫放置了假密钥?
• Look up Jeff’s public key in a public database of keys (a key server). Again, this seems to be straightforward, but how does she know she has found the right person, or someone has not placed a false key for Jeff on this particular server?
这两个问题可以通过某种声誉系统来解决。例如,在公钥的情况下,杰夫可以要求他的几个熟悉他的朋友使用他们的私钥签署他的公钥。他们在他的公钥上的签名实质上是在说:“我认识杰夫,我知道这是他的公钥。” Fish 可以检查这个朋友列表,以确定其中是否有她可以信任的人。基于此检查,Fish 可以确定她要么相信这个特定密钥是 Jeff 的密钥,要么不相信。
These two problems can be solved through some sort of reputation system. For instance, in the case of a public key, Jeff could ask several of his friends, who know him well, to sign his public key using their private keys. Their signature on his public key essentially says, “I know Jeff, and I know this is his public key.” Fish can examine this list of friends to determine if there are any of them she can trust. Based on this examination, Fish can determine she either trusts that this specific key is Jeff’s key, or she does not.
在这种情况下,由菲什决定她将接受多少证据以及何种类型的证据。例如,她是否应该接受她拥有的钥匙实际上是杰夫的,因为
In this situation, it is up to Fish to determine how much, and what sort of, proof she will accept. Should she, for instance, accept that the key she has is actually Jeff’s because
• 她直接认识杰夫的一位朋友,并相信第三者会告诉她真相。
• She directly knows one of Jeff’s friends and trusts this third person to tell her the truth.
• 她认识一个认识Jeff 的一位朋友的人,并且相信她的这位朋友会告诉她有关Jeff 的朋友的真相,因此也相信Jeff 的朋友会说出有关Jeff 和他的钥匙的真相。
• She knows someone who knows one of Jeff’s friends, and trusts this friend of hers to tell her the truth about Jeff’s friend, and hence trusts Jeff’s friend to tell the truth about Jeff and his key.
• 她认识几个人,这些人也认识Jeff 的几个朋友,并根据几个人的证词决定相信这是Jeff 的钥匙。
• She knows several people who know several of Jeff’s friends and makes a decision to trust this is Jeff’s key based on the testimony of several people.
这种系统称为信任网。总的想法是信任具有不同程度的传递性。传递信任的概念有些争议,但信任网络背后的想法是,如果您收到足够的证据,您可以在人/密钥配对中建立信任。这种信任网络的一个例子是 Pretty Good Privacy 生态系统,人们在会议上会面并交叉签署彼此的密钥,建立了一个可传递的信任关系网络,当他们的通信进入纯电子方式时可以依赖该网络。领域。
This kind of system is called a web of trust. The general idea is that trust has different levels of transitivity. The concept of transitive trust is somewhat controversial, but the idea behind a web of trust is if you receive enough evidence, you can build up a trust in a person/key pairing. An example of this kind of web of trust is the Pretty Good Privacy ecosystem, where people meet at conferences to cross sign one another’s keys, building up a web of transitive trust relationships that can be relied on when their communication moves into the electronic-only realm.
另一种选择是密钥服务器所有者可以以某种方式对杰夫进行调查,并确定他是否真的是他所说的那个人,以及这是否真的是他的密钥。这种解决方案最清晰的“现实世界”例子是公证人。如果您在公证人面前签署文件,他会检查某种形式的身份证明(验证您是谁),然后看着您在文件上签名(验证您的密钥)。
Another option is the key server owner could somehow do an investigation of Jeff and determine if he really is who he says he is, and whether or not this is really his key. The clearest “real-world” example of this sort of solution is a public notary. If you sign a document in front of a notary, he checks for some form of identification (verifying who you are) and then watches you physically sign the document (verifying your key).
这种验证被称为中央信任源(或类似的,尽管它几乎总是有“集中”这个词)或公钥基础设施(PKI)。该解决方案取决于 Fish 信任集中式密钥存储库的流程和诚实性。
This kind of validation is called a central source of trust (or similar—though it almost always has the word centralized in it) or a Public Key Infrastructure (PKI). The solution depends on Fish trusting the process and honesty of the centralized key repository.
鉴于对称密钥加密技术的处理速度比公钥加密技术快得多,您理想情况下希望使用对称共享密钥加密任何长期存在或大量的流。但是,如果没有以某种方式物理交换密钥,如何在通过网络连接的两个设备之间交换单个私钥呢?用图10-3来说明。
Given symmetric key cryptography is so much faster to process than public key cryptography, you would ideally like to encrypt any long-standing or high-volume flows using a symmetric shared secret key. But, short of somehow physically exchanging keys, how is it possible to exchange a single private key between two devices that are connected over a network? Figure 10-3 is used to illustrate.
在图10-3中:
In Figure 10-3:
1. 假设 A 开始该过程。A 将使用 B 的公钥加密一个随机数,即在此过程中使用一次的随机数,然后将其丢弃(实际上,随机数是一次性密码本的一种形式)。由于随机数已使用 B 的公钥加密,因此理论上只有 B 可以解密该随机数,因为只有 B 应该知道 B 的私钥。
1. Assume A begins the process. A will encrypt a nonce, a random number used once in the process and then thrown away (a nonce is a form of a one-time pad, in effect), using B’s public key. Because the nonce has been encrypted with B’s public key, in theory only B can unencrypt the nonce, as only B should know B’s private key.
2. B 在解密随机数时,现在将向 A 发送一些新的随机数。这可能包括 A 的原始随机数,或 A 的原始随机数加上一些其他信息。关键是,A 必须肯定知道包含 A 随机数的原始消息是由 B 接收的,而不是由充当 B 的其他系统接收的。B 可以确保这一点,其中包括使用其公钥加密的一些信息,因为 B 是唯一可以解密它的系统。
2. B, on unencrypting the nonce, will now send some new nonce to A. This may include A’s original nonce, or A’s original nonce plus some other information. The point is that A must know, for certain, that the original message including A’s nonce was received by B—and not some other system acting as B. This is ensured by B including some piece of information that was encrypted using its public key, as B is the only system that could have unencrypted it.
3. A 和 B 使用此时交换的随机数和其他信息来计算私钥,然后使用该私钥对两个系统之间传输的信息进行加密/解密。
3. A and B, using the nonces and other information exchanged to this point, will calculate a private key, which is then used to encrypt/unencrypt information transferred between the two systems.
这里概述的步骤有些幼稚;有更好、更安全的系统,例如互联网密钥交换(IKE)协议;有关该领域的资源,请参阅本章末尾的“进一步阅读”部分。
The steps outlined here are somewhat naïve; there are better, more secure systems, such as the Internet Key Exchange (IKE) protocol; see the “Further Reading” section at the end of the chapter for resources in this area.
假设您想要发送一个大文本文件,甚至是一张图像,并允许接收者验证它是否源自您。如果相关数据非常大怎么办?或者如果数据需要压缩才能有效传输?密码算法和压缩之间存在天然的冲突;加密算法尝试产生最大程度的随机输出,压缩算法尝试利用数据的非随机性将数据压缩为更少的位数。或者,您可能希望任何想要阅读该信息的人都可以阅读该信息,这意味着不对其进行加密,但您希望接收者能够验证您是否传输了该信息(如果他们愿意)。
Assume you wanted to send a large text file, or even an image, and allow receivers to validate it originated from you. What if the data in question is very large? Or what if the data needs to be compressed to be transmitted effectively? There is a natural conflict between cryptographic algorithms and compression; cryptographic algorithms attempt to produce maximally random output, and compression algorithms attempt to take advantage of nonrandomness in the data to compress data into a smaller number of bits. Or perhaps you want the information to be read by anyone who would like to read it, which means not encrypting it, but you would like receivers to be able to verify you transmitted it if they would like to.
加密哈希旨在提供解决这些问题的解决方案。第 7 章“数据包交换”中有对哈希值的简要说明。您可能已经注意到哈希概念和加密算法之间至少有一个相似之处。具体来说,哈希被设计为获取非常大的数据,并创建固定长度的表示,因此对于各种输入,输出中很少有冲突。这与密码算法所需的任何输入尽可能接近随机输出的概念非常相似。另一个值得一提的相似之处是,哈希算法和加密算法在输入空间非常稀疏的情况下都能更好地工作。
Cryptographic hashes are designed to provide a solution to resolve these problems. There is a brief explanation of hashes in Chapter 7, “Packet Switching.” You might have already noticed at least one similarity between the idea of a hash and a cryptographic algorithm. Specifically, a hash is designed to take a very large piece of data, and create a fixed length representation so there are very few collisions in the output for a wide range of inputs. This is very similar to the concept of as close to random output for any input required of a cryptographic algorithm. Another similarity worth mentioning is that hash and cryptographic algorithms both work better with a very sparsely populated input space.
加密哈希只是用加密函数替换普通哈希函数。在这种情况下,可以计算哈希值并将其与数据一起发布或与数据一起传输。
A cryptographic hash simply replaces the normal hash function with a cryptographic function. In this case, the hash can be calculated and either posted alongside the data or transmitted with the data.
加密哈希可以与对称或公钥系统一起使用,但它们通常与公钥系统一起使用。
Cryptographic hashes can either be used with symmetric or public key systems, but they are normally used with public key systems.
回到章节介绍,另一个安全问题空间是数据耗尽。对于个人用户,数据耗尽可用于跟踪用户在网络上时正在做什么(而不仅仅是进程)。例如:
Returning to the chapter introduction, another security problem space is data exhaust. In the case of individual users, data exhaust can be used to trace what users are doing while they are on the network (rather than just processes). For instance:
• 如果您始终随身携带手机,则可以跟踪媒体访问控制 (MAC) 地址在无线连接点之间移动时的移动情况,从而跟踪您的身体活动。
• If you carry a cell phone with you at all times, it is possible to trace the movement of the Media Access Control (MAC) address as it moves between wireless connection points to trace your physical movements.
• 由于大多数数据流不是对称的——数据通过大数据包传递,而确认则通过小数据包传递——观察者可以发现您何时上传和下载数据,甚至可能何时完成小型事务。与目标服务器相结合,此信息可以揭示您作为用户在特定情况下或一段时间内的行为。这种分析以及许多其他类型的流量分析甚至可以对加密流量进行。
• Since most data streams are not symmetrical—data passes through large packets, while acknowledgments are passed through small packets—an observer can discover when you are uploading and downloading data, and perhaps even when you are completing small transactions. Combined with the destination server, this information could reveal a good bit about your behavior as a user in a particular situation, or over time. This, and many other kinds of traffic analysis, can be performed even on encrypted traffic.
• 当您从一个网站移动到另一个网站时,观察者可以追踪您在每个网站上花费的时间、您点击的内容、您如何到达下一个网站、您浏览过哪些内容。搜索过的内容、您随时打开的网站等。这些信息可以揭示您作为一个人的一些信息、您想要实现的目标以及其他个人因素。
• As you move from website to website, an observer can trace how long you spend on each one, what you click on, how you reached the next site, what you have searched for, what sites you keep open at any time, etc. This information can reveal a good bit about you as a person, what you are trying to accomplish, and other personal factors.
以下部分介绍了该领域感兴趣的两种解决方案,作为可用解决方案类型的示例:MAC 地址随机化和洋葱路由。
Two solutions of interest in this space are covered in the following sections as examples of the sorts of solutions available: MAC address randomization and onion routing.
电气和电子工程师协会 (IEEE) 最初设计了 MAC-48 地址空间,如第 4 章“较低层传输”中所述,由网络接口制造商分配。然后,网络设备制造商将“按原样”使用这些地址,因此每个硬件都将具有固定的、不可变的硬件地址。这个过程早在手机还只是一个梦想、隐私成为问题之前就已经设计好了。
The Institute of Electrical and Electronic Engineers (IEEE) originally designed the MAC-48 address space, described in Chapter 4, “Lower Layer Transports,” to be assigned by manufacturers of the network interfaces. These addresses would then be used “as is” by manufacturers of networking equipment, so each piece of hardware would have a fixed, immutable hardware address. This process was designed long before cell phones were even a dream on the horizon and before privacy became an issue.
在现代世界中,这意味着无论单个设备连接到网络的位置如何,都可以被跟踪。许多用户认为这是不可接受的,特别是因为不仅提供商可以跟踪此信息,而且任何可以监听无线信号的人(这意味着任何拥有天线的人)都可以跟踪该信息。解决此问题的一种方法是允许设备定期更改其 MAC 地址,甚至可能在每个数据包中使用不同的 MAC 地址。由于提供商网络之外的第三方侦听器无法“猜测”任何设备将使用的下一个 MAC 地址,因此它无法跟踪特定设备。使用 MAC 地址随机化的设备还将在其加入的每个网络上使用不同的 MAC 地址,因此无法跨多个网络进行跟踪。
In the modern world, this means a single device can be followed regardless of where it connects to the network. Many users find this unacceptable, particularly as it is not just the provider who can track this information, but anyone who can listen in on the wireless signal, which means anyone with an antenna. One way to solve this is to allow the device to change its MAC address on a regular basis, even perhaps using a different MAC address in each packet. Since a third party listener, outside the provider network, cannot “guess” the next MAC address any device will use, it cannot track a particular device. A device that uses MAC address randomization will also use a different MAC address on each network it joins, so it will not be trackable across multiple networks.
针对 MAC 地址随机化的攻击主要围绕用户使用网络的身份验证。大多数身份验证系统都依赖于 MAC 地址,因为它被编程到设备中,以识别设备,进而识别用户。一旦 MAC 地址不再是一个不变的标识符,就必须有其他的解决方案。MAC 地址随机化可能受到攻击的地方是
There are attacks against MAC address randomization, primarily centering around the user’s authentication to use the network. Most authentication systems rely on the MAC address, because it is programmed into the device, to identify the device, and in turn, the user. Once the MAC address is no longer an unchanging identifier, there must be some other solution. Places where MAC address randomization can be attacked are
•定时:如果设备要更改其MAC 地址,它必须以某种方式将这些更改告知无线链路的另一端,以便连接的设备和基站之间的信道能够保持可用。必须有一些商定的计时系统,以便更改的 MAC 地址可以在更改期间继续通信。如果攻击者可以确定何时发生此更改,那么她可以观察正确的时间窗口并发现设备所采用的新 MAC 地址。
• Timing: If a device is going to change its MAC address, it must somehow tell the other end of the wireless link about these changes, so the channel between the connected device and the base station can remain viable. There must be some agreed-on system of timing so the changing MAC address can continue communicating across the change. If an attacker can determine when this change will take place, then she can watch the right window of time and discover the new MAC address the device takes on.
•序列号:与所有传输系统一样,必须有某种方法来确定所有数据包是否已被接收或丢弃。攻击者可以跟踪用于跟踪数据包传送和确认的序列号。与刚才提到的定时攻击相结合,这可以在 MAC 地址变化时提供对特定设备的相当确定的识别。
• Sequence numbers: As with all transport systems, there must be some way to determine if all the packets have been received or dropped. An attacker can track the sequence numbers being used to track packet delivery and acknowledgment. Combined with the timing attack just noted, this can provide fairly certain identification of a specific device across MAC address changes.
•信息元素指纹:每个移动设备都有一组可以支持的功能,例如安装的浏览器、扩展程序、应用程序和附加硬件。由于每个用户都是唯一的,因此他使用的应用程序集也可能相当唯一,从而创建功能指纹,该指纹将通过信息元素报告,以响应来自基站的探测。
• Information element fingerprints: Each mobile device has a set of capabili-ties it can support, such as installed browsers, extensions, apps, and additional hardware. Because each user is unique, the set of applications he uses will also likely be fairly unique, creating a fingerprint of capabilities that will be reported through the information element in response to probes from the base station.
•服务集标识符(SSID) 指纹:每个设备都会保留其当前可以访问的网络以及(可能)在过去某个时刻可以访问的网络的列表。该列表可能相当唯一,因此可以充当设备标识符。
• Service Set Identifier (SSID) fingerprints: Each device keeps a list of networks it can currently reach and (potentially) networks it could reach at some point in the past. This list is likely to be fairly unique, and hence can act as a device identifier.
虽然这些项目中的每一项都可以在设备级别提供某种程度的唯一性,但是这些项目的组合可以非常接近于识别特定设备,足以在跟踪连接到无线网络的任何特定用户时实际有用。
While each of these items may provide some level of uniqueness at a device level, the combination of these items can come very close to identifying a specific device often enough to be practically useful in tracking any specific user connecting to a wireless network.
这并不意味着 MAC 地址随机化毫无用处,而是这是在连接到无线网络时保护用户隐私的一个步骤。
This does not mean MAC address randomization is useless, but rather this is one step in preserving user privacy when connected to a wireless network.
洋葱路由是一种用于伪装通过网络的用户流量路径以及加密的机制。用图10-4来说明。
Onion routing is a mechanism used to disguise the path of, as well as encrypt, user traffic passing through a network. Figure 10-4 is used to illustrate.
在图10-4中,主机A想要安全地向K发送一些流量,网络中的任何其他节点都无法看到主机和服务器之间的连接,并且任何观察者都无法看到明文。为了通过洋葱路由实现这一点,A 执行以下操作:
In Figure 10-4, host A wants to send some traffic to K securely, without any other node in the network being able to see the connection between the host and the server, and without any observer being able to see the plaintext. To accomplish this with onion routing, A does the following:
1.它使用一个服务来寻找一组可以互连的节点,并提供到服务器K的路径。假设这组节点包括[B,D,G];虽然插图将它们显示为路由器,但它们更可能是在主机上运行的软件路由器,而不是专用网络设备。主机 A 将首先找到 B 的公钥,并使用此信息与 B 建立对称密钥加密会话。
1. It uses a service to find a set of nodes that can interconnect and provide a path to the server, K. Assume this set of nodes includes [B,D,G]; while the illustration shows these as routers, they are more likely software routers running on hosts, rather than dedicated network devices. Host A will first find B’s public key and use this information to build a symmetric key encrypted session with B.
2. 一旦建立了该会话,A 将找到 D 的公钥,并使用该信息与 D 交换一组对称密钥,最终使用该对称密钥与 D 建立会话以加密安全通道。值得注意的是,从 D 的角度来看,这次会议是与 B 进行的,而不是与 A 进行的;主机 A 只是指示 B 代表其执行这些操作,而不是直接执行这些操作。这意味着 D 不知道 A 是流量的发起者;它只知道流量来自 B 并从那里通过加密链路传输。
2. Once this session is established, A will then find D’s public key, and use this information to exchange a set of symmetric keys with D, finally building a session to D using this symmetric secret key to encrypt the secured channel. It is important to note that from D’s perspective, this session is with B, rather than A; host A simply instructs B to take these actions on its behalf, rather than doing them directly. This means that D does not know A is the originator of the traffic; it only knows the traffic is sourced from B and carried across an encrypted link from there.
3. 一旦建立了此会话,A 将指示 D 与 G 建立会话,就像指示 B 与 D 建立会话一样。D 现在知道目的地是 G,但不知道流量将在哪里由 G 路由。
3. Once this session is established, A will then instruct D to set up a session with G in the same way it instructed B to set up a session with D. D now knows the destination is G but does not know where the traffic will be routed by G.
主机 A 现在拥有到 K 的安全路径,具有以下属性:
Host A now has a secure path to K with the following properties:
• 路径上每对节点之间的流量均使用不同的对称私钥进行加密。断开路径上一对节点之间的连接的攻击者仍然无法观察到路径中其他节点之间传输的流量。
• The traffic between each pair of nodes along the path is encrypted with a different symmetric private key. An attacker that breaks the connection between one pair of nodes along the path still cannot observe the traffic being transmitted between nodes elsewhere in the path.
• 出口节点G 知道流量的目的地,但不知道流量的来源。
• The exit node, which is G, knows the destination but not the source of the traffic.
• 入口节点(B)知道流量的来源,但不知道目的地。
• The entrance node, which is B, knows the source of the traffic but not the destination.
在这种网络中,只有 A 知道自己和目的地之间的完整路径。中间节点甚至不知道有多少个节点路径——他们知道前一个和下一个节点。针对此类系统的主要攻击形式是接管尽可能多的出口节点,这样您就可以观察从整个网络出口的流量,并将其关联回完整的信息流。
In this kind of network, only A knows the full path between itself and the destination. The intermediate nodes do not even know how many nodes are in the path—they know about the previous and next nodes. The primary form of attack against such a system is to take over as many exit nodes as you can, so you can observe the traffic exiting from the entire network, and correlate it back into a full stream of information.
传输层安全性 (TLS) 也称为安全套接字层(SSL),是大多数 Web 浏览器中默认部署的安全传输层协议。当用户看到表明网站“安全”的绿色小锁时,这意味着 SSL 证书有效,并且主机(运行浏览器的主机)和服务器(运行 Web 服务器的主机)之间的流量正在正常运行。加密的。TLS 是一个复杂的协议,有很多不同的选项;本节将对其操作进行粗略概述。图 10-6说明了 TLS 套件的组件。
Transport Layer Security (TLS), also known as the Secure Socket Layer (SSL), is a secure transport layer protocol deployed by default in most web browsers. When users see the small green lock indicating that a website is “safe,” this means the SSL certificate is valid, and the traffic between the host (on which the browser runs) and the server (on which the web server runs) is being encrypted. TLS is a complex protocol with a lot of different options; this section will provide a rough overview of its operation. Figure 10-6 illustrates the components of the TLS suite.
在图 10-6中:
In Figure 10-6:
•握手协议负责初始化会话和设置会话参数,包括初始私钥交换。
• The handshake protocol is responsible for initializing sessions and setting up session parameters, including the initial private key exchange.
•警报协议负责错误处理。
• The alert protocol is responsible for error handling.
•更改密码规范负责启动加密。
• The change cipher specification is responsible for starting the encryption.
•记录协议将要传输的数据块分解为片段,(可选)压缩数据,添加消息验证代码(MAC),使用对称密钥加密数据,将原始信息添加到块中,然后发送该块到传输控制协议(TCP)以通过网络进行传输。
• The record protocol breaks data blocks presented for transport into fragments, (optionally) compresses the data, adds a Message Authentication Code (MAC), encrypts the data using the symmetrical key, adds the original information to the block, and then sends the block to the Transmission Control Protocol (TCP) for transport across the network.
在 TLS 之上运行的应用程序使用特殊的端口号通过 TLS 访问服务。例如,使用超文本传输协议 (HTTP) 的 Web 服务通常可通过 TCP 端口 80 访问;TLS 加密的 HTTP 通常可通过端口 443 访问。虽然服务相同,但端口号的更改允许 TCP 进程引导需要解密的流量,以便最终应用程序读取它。
Applications running on top of TLS use a special port number to access the service through TLS. For instance, web services using the Hypertext Transfer Protocol (HTTP) are normally accessible over TCP port 80; TLS-encrypted HTTP is normally accessible through port 443. While the service is the same, the change in the port number allows the TCP process to direct traffic that needs to be unencrypted for the final application to read it.
MAC,在此上下文中意味着消息验证码,用于确保发送者经过验证。虽然某些加密系统假设使用接收者知道的密钥成功加密数据可以证明发送者确实是他所声称的身份,但 TLS 却不然。相反,TLS 包含一个 MAC,该 MAC 与用于加密线路上消息的密钥分开验证发送者。这有助于防止针对 TLS 加密数据流的 MitM 攻击。
The MAC, which within this context will mean a Message Authentication Code, is used to ensure the sender is authenticated. While some cryptography systems assume that successfully encrypting data with a key the receiver knows proves the sender is truly who he claims to be, TLS does not. Instead, TLS includes a MAC that validates the sender separately from the keys used to encrypt messages on the wire. This helps prevent MitM attacks against TLS-encrypted data streams.
图10-7显示了TLS启动握手,它由握手协议管理。
Figure 10-7 shows the TLS startup handshake, which is managed by the handshake protocol.
在图 10-7中:
In Figure 10-7:
1. 客户端问候以明文形式发送,包含有关客户端正在运行的 TLS 版本的信息、32 个随机八位字节(随机数)、会话标识符(允许恢复或还原先前的会话)、列表客户端支持的加密算法(密码套件)以及客户端支持的数据压缩算法的列表。
1. The client hello is sent in plaintext, and contains information about the version of TLS the client is running, 32 random octets (the nonce), a session identifier (which allows a previous session to be recovered or restored), a list of the encryption algorithms (cipher suites) the client supports, and a list of the data compression algorithms the client supports.
2. 从服务器的角度来看,服务器问候也以明文形式发送,并且包含与上述相同的信息。在服务器问候中,加密算法字段指示将用于此会话的加密类型。这通常是客户端和服务器可用的“最佳”加密算法(尽管它并不总是“最佳”)。
2. The server hello is sent in plaintext, as well, and contains the same information as above, from the server’s perspective. In the server hello, the encryption algorithm field indicates the kind of encryption that will be used for this session. This is normally the “best” encryption algorithm available at both the client and the server (although it is not always the “best”).
3. 服务器发送其公钥(证书)以及客户端发送到服务器的随机数,其中现在使用服务器的私钥对随机数进行加密。
3. The server sends its public key (a certificate), along with the nonce that the client sent to the server, where the nonce is now encrypted using the server’s private key.
4.服务器问候完成消息指示客户端现在拥有完成会话设置所需的信息。
4. The server hello done message indicates the client now has the information it needs to complete the session setup.
5. 客户端生成私钥并使用服务器的公钥对其进行加密。这是在客户端密钥交换消息中向服务器传输的。
5. The client generates a private key and uses the server’s public key to encrypt it. This is transmitted in the client key exchange message toward the server.
6. 传输完毕后,客户端必须签署服务器和客户端都知道的内容,以验证发送者是正确的设备。通常,到目前为止,交换中的所有消息都会有签名;通常,使用加密哈希来生成验证。
6. Once this has been transmitted, the client must sign something known to both the server and the client in order to verify the sender is the correct device. Usually, the signature is across all the messages in the exchange up to this point; generally, a cryptographic hash is used to generate a verification.
7.更改密码规范消息实质上确认会话已启动并正在运行。
7. The change cipher specification message essentially acknowledges the session is up and running.
8.完成的消息再次验证了之前的所有握手消息。
8. The finished message once again authenticates all the previous handshake messages to this point.
9. 然后,服务器通过发送更改密码规范消息来确认加密会话已建立。
9. The server then acknowledges the encryption session is set up by sending a change cipher specification message.
10. 然后,服务器发送一条完成消息,该消息以与上述相同的方式对握手中发送的先前消息进行身份验证。
10. The server then sends a finished message, which authenticates the prior messages sent in the handshake in the same way as above.
笔记
Note
为了清楚起见,本说明中省略了 TLS 握手中的可选步骤。
Optional steps in the TLS handshake have been left out of this explanation for clarity.
一旦会话启动并运行,应用程序就可以通过正确的端口号向接收主机发送信息。该数据将使用先前协商的私钥进行加密,然后交给 TCP 进行传递。
Once the session is up and running, applications can send information toward the receiving host on the correct port number. This data will be encrypted using the previously negotiated private key and then handed off to TCP for delivery.
本章考虑了传输安全领域的三个具体问题:验证数据、保护数据不被检查以及保护用户隐私。对于网络工程师来说,了解传输安全如何工作的理论以及传输安全系统中的弱点与网络设计相互作用通常比了解实际安全机制本身的细节更重要。因此,本章重点以“如何思考运输安全”的形式提供更强有力的理论基础,而不是运输安全的实际实施。我们鼓励有兴趣深入探索运输安全的读者查看本章末尾的“进一步阅读”部分。
This chapter has considered three specific problems in the space of transport security: validating data, protecting data from being examined, and protecting user privacy. For network engineers, understanding the theory of how transport security works and where the weak spots in a transport security system interact with the network design is often more important than understanding the intimate details of the actual security mechanisms themselves. Because of this, this chapter has focused on providing a stronger theoretical foundation in the form of “how to think about transport security,” rather than on practical implementations of trans-port security. Readers who are interested in a deeper exploration of transport security are encouraged to look at the “Further Reading” section at the end of this chapter.
总体而言,传输安全只是网络工程所需整体安全的一小部分;第 21 章“安全性:更广泛的范围”考虑了网络和系统级别上更广泛的安全主题。
Overall, transport security is just one small piece of the overall security required in network engineering; Chapter 21, “Security: A Broader Sweep,” considers a broader sweep of security topics at both a network and a system level.
鲍尔、凯文、达蒙·麦考伊、德克·格伦沃尔德、河野忠义和道格拉斯·西克。“针对 Tor 的低资源路由攻击。” 2007 年 ACM 电子社会隐私研讨会论文集,11-20。WPES '07。美国纽约州纽约:ACM,2007。doi:10.1145/1314333.1314336。
Bauer, Kevin, Damon McCoy, Dirk Grunwald, Tadayoshi Kohno, and Douglas Sicker. “Low-Resource Routing Attacks Against Tor.” In Proceedings of the 2007 ACM Workshop on Privacy in Electronic Society, 11–20. WPES ’07. New York, NY, USA: ACM, 2007. doi:10.1145/1314333.1314336.
布罗克纳、弗兰克、Shwetha Bhandari、Sashank Dara、Carlos Pignataro、John Leddy、Stephen Youell、David Mozes 和 Tal Mizrahi。“过境证明。” 互联网草案。互联网工程任务组,2017 年 3 月。https ://datatracker.ietf.org/doc/html/draft-brockners-proof-of-transit-03。
Brockners, Frank, Shwetha Bhandari, Sashank Dara, Carlos Pignataro, John Leddy, Stephen Youell, David Mozes, and Tal Mizrahi. “Proof of Transit.” Internet-Draft. Internet Engineering Task Force, March 2017. https://datatracker.ietf.org/doc/html/draft-brockners-proof-of-transit-03.
戴维斯、约书亚. 使用密码学和 PKI 实施 SSL/TLS。第一版。新泽西州霍博肯:Wiley,2011。
Davies, Joshua. Implementing SSL / TLS Using Cryptography and PKI. 1st edition. Hoboken, NJ: Wiley, 2011.
达克林、保罗. “您的加密数据透露了您的哪些信息。” 裸安全,2016 年 3 月 18 日。https: //nakedsecurity.sophos.com/2016/03/18/what-your-encrypted-data-says-about-you/。
Ducklin, Paul. “What Your Encrypted Data Says about You.” Naked Security, March 18, 2016. https://nakedsecurity.sophos.com/2016/03/18/what-your-encrypted-data-says-about-you/.
弗格森、尼尔斯和布鲁斯·施奈尔。实用密码学。第一版。纽约:威利,2003 年。
Ferguson, Niels, and Bruce Schneier. Practical Cryptography. 1st edition. New York: Wiley, 2003.
弗格森、尼尔斯、布鲁斯·施奈尔和河野忠义。密码学工程:设计原理和实际应用。第一版。印第安纳州印第安纳波利斯:Wiley,2010。
Ferguson, Niels, Bruce Schneier, and Tadayoshi Kohno. Cryptography Engineering: Design Principles and Practical Applications. 1st edition. Indianapolis, IN: Wiley, 2010.
卡茨、乔纳森和耶胡达·林德尔。现代密码学简介。第二版。佛罗里达州博卡拉顿:查普曼和霍尔/CRC,2014 年。
Katz, Jonathan, and Yehuda Lindell. Introduction to Modern Cryptography. 2nd edition. Boca Raton, FL: Chapman and Hall/CRC, 2014.
查理·考夫曼、保罗·E·霍夫曼、约夫·尼尔、帕西·埃罗宁和特罗·基维宁。互联网密钥交换协议版本 2 (IKEv2)。征求意见 7296。RFC 编辑,2014。doi:10.17487/RFC7296。
Kaufman, Charlie, Paul E. Hoffman, Yoav Nir, Pasi Eronen, and Tero Kivinen. Internet Key Exchange Protocol Version 2 (IKEv2). Request for Comments 7296. RFC Editor, 2014. doi:10.17487/RFC7296.
马特、塞莱斯廷、马蒂厄·昆奇、弗兰克·卢梭和马西·范霍夫。“通过定时攻击击败 MAC 地址随机化。” 第九届 ACM 安全会议记录#38;无线和移动网络中的隐私,15-20。WiSec '16。纽约,纽约:ACM,2016。doi:10.1145/2939918.2939930。
Matte, Célestin, Mathieu Cunche, Franck Rousseau, and Mathy Vanhoef. “Defeating MAC Address Randomization Through Timing Attacks.” In Proceedings of the 9th ACM Conference on Security #38; Privacy in Wireless and Mobile Networks, 15–20. WiSec ’16. New York, NY: ACM, 2016. doi:10.1145/2939918.2939930.
纳拉亚南、阿尔文德、约瑟夫·博诺、爱德华·费尔滕、安德鲁·米勒和史蒂文·戈德费德。比特币和加密货币技术:综合介绍。新泽西州普林斯顿:普林斯顿大学出版社,2016 年。
Narayanan, Arvind, Joseph Bonneau, Edward Felten, Andrew Miller, and Steven Goldfeder. Bitcoin and Cryptocurrency Technologies: A Comprehensive Introduction. Princeton, NJ: Princeton University Press, 2016.
克里斯托夫·帕尔、简·佩尔兹尔和巴特·普雷内尔。理解密码学:学生和从业者的教科书。第一版。海德堡;纽约:施普林格,2010。
Paar, Christof, Jan Pelzl, and Bart Preneel. Understanding Cryptography: A Textbook for Students and Practitioners. 1st edition. Heidelberg; New York: Springer, 2010.
派珀、弗雷德和肖恩·墨菲。密码学:非常简短的介绍。第一版。牛津; 纽约:牛津大学出版社,2002 年。
Piper, Fred, and Sean Murphy. Cryptography: A Very Short Introduction. 1st edition. Oxford; New York: Oxford University Press, 2002.
雷斯科拉、埃里克和蒂姆·迪克斯。传输层安全 (TLS) 协议版本 1.2。征求意见 5246。RFC 编辑,2008。doi:10.17487/RFC5246。
Rescorla, Eric, and Tim Dierks. The Transport Layer Security (TLS) Protocol Version 1.2. Request for Comments 5246. RFC Editor, 2008. doi:10.17487/RFC5246.
施奈尔、布鲁斯. 应用密码学:C 语言协议、算法和源代码。第一版。印第安纳州印第安纳波利斯:Wiley,2015。
Schneier, Bruce. Applied Cryptography: Protocols, Algorithms and Source Code in C. 1st edition. Indianapolis, IN: Wiley, 2015.
希梅尔、蒂姆. “网络安全流量分析:超越网络流量数据的两种方法。” SEI 博客,2016 年 9 月 16 日。https ://insights.sei.cmu.edu/sei_blog/2016/09/traffic-analysis-for-network-security-two-approaches-for-going-beyond-network-flow-数据.html。
Shimeall, Tim. “Traffic Analysis for Network Security: Two Approaches for Going Beyond Network Flow Data.” SEI Blog, September 16, 2016. https://insights.sei.cmu.edu/sei_blog/2016/09/traffic-analysis-for-network-security-two-approaches-for-going-beyond-network-flow-data.html.
席尔瓦,约翰·爱德华。“加密哈希函数及其用途概述。” SANS 研究所,2013 年 1 月 15 日。https ://www.sans.org/reading-room/whitepapers/vpns/overview-cryptographic-hash-functions-879。
Silva, John Edward. “An Overview of Cryptographic Hash Functions and Their Uses.” SANS Institute, January 15, 2013. https://www.sans.org/reading-room/whitepapers/vpns/overview-cryptographic-hash-functions-879.
索伯斯、罗布. “加密哈希函数权威指南(第 1 部分)。” Varonis 博客,2012 年 8 月 2 日。https: //blog.varonis.com/the-definitive-guide-to-cryptographic-hash-functions-part-1/。
Sobers, Rob. “The Definitive Guide to Cryptographic Hash Functions (Part 1).” Varonis Blog, August 2, 2012. https://blog.varonis.com/the-definitive-guide-to-cryptographic-hash-functions-part-1/.
———。“加密哈希函数权威指南(第二部分)。” Varonis 博客,2012 年 8 月 14 日。https: //blog.varonis.com/the-definitive-guide-to-cryptographic-hash-functions-part-ii/。
———. “The Definitive Guide to Cryptographic Hash Functions (Part II).” Varonis Blog, August 14, 2012. https://blog.varonis.com/the-definitive-guide-to-cryptographic-hash-functions-part-ii/.
斯托林斯,威廉. 密码学和网络安全:原理与实践。第 7 版。马萨诸塞州波士顿:皮尔逊,2016 年。
Stallings, William. Cryptography and Network Security: Principles and Practice. 7th edition. Boston, MA: Pearson, 2016.
范霍夫、马蒂、塞莱斯廷·马特、马蒂厄·昆奇、莱昂纳多·S·卡多佐和弗兰克·皮森斯。“为什么 MAC 地址随机化还不够:Wi-Fi 网络发现机制分析。” 第 11 届 ACM 亚洲计算机和通信安全会议论文集,413-24。亚洲 CCS '16。纽约州纽约:ACM,2016。doi:10.1145/2897845.2897883。
Vanhoef, Mathy, Célestin Matte, Mathieu Cunche, Leonardo S. Cardoso, and Frank Piessens. “Why MAC Address Randomization Is Not Enough: An Analysis of Wi-Fi Network Discovery Mechanisms.” In Proceedings of the 11th ACM on Asia Conference on Computer and Communications Security, 413–24. ASIA CCS ’16. New York, NY: ACM, 2016. doi:10.1145/2897845.2897883.
1. 中间人攻击被视为主要的安全弱点,但在许多情况下,系统被故意置于加密数据流中。中间的系统充当代理,在数据通过系统时对数据进行解密和重新加密。至少找到一个此类系统的用例,并解释此类系统的一些积极和消极方面。
1. Man-in-the-middle attacks are seen as a major security weakness, but there are many situations in which a system is intentionally placed in the flow of an encrypted stream of data. The system in the middle acts as a proxy, unencrypting and reencrypting the data as it passes through the system. Find at least one use case for this kind of system, and explain some of the positive and negative aspects of such a system.
2. 作为数据耗尽的一个例子,研究网络浏览器指纹的想法。描述这个概念、它的准确性以及可用的缓解措施。
2. As an example of data exhaust, research the idea of a web browser fingerprint. Describe the concept, how accurate it is, and what mitigations are available.
3. 哈希和密码算法有很多相似之处,也有一些不同之处。描述这些相似点和不同点。
3. A hash and a cryptographic algorithm have many similarities and some differences. Describe these similarities and differences.
4. 找到至少一种使用多轮加密的密码系统。为什么选择这个轮数?加密系统是否建议进行更多轮次以获得更高的安全性?
4. Find at least one cryptographic system that uses multiple rounds of encryption. Why was this number of rounds chosen? Does the encryption system suggest more rounds for higher security, or not?
5. 近年来,公证人的概念变得越来越难以设计和实现。描述此类系统可能面临的一些挑战以及克服这些挑战的一些方法。
5. In recent years, the concept of a public notary has become more difficult to design and fulfill. Describe some of the challenges such a system might face and some of the ways in which these challenges might be overcome.
6. IPv4 和 IPv6 的 MAC 地址随机化实现方式是否不同?有哪些差异以及它们为何存在?
6. Is MAC address randomization implemented differently with IPv4 and IPv6? What are the differences, and why do they exist?
7. 研究 IPsec。它与 TLS 有什么不同?它在协议栈的哪一层进行加密,有哪些可用的操作模式?
7. Investigate IPsec. How is it different from TLS? At what layer of the protocol stack does it encrypt, and what modes of operation are available?
构建单个数据包处理设备[md]路由器(或第 3 层交换机,现在通常简称为交换机,让几乎每个人都感到困惑)是最常见的示例[md]到目前为止一直是焦点。现在是时候开始将路由器连接在一起了。考虑图 P2-1中的网络。
Building a single packet processing device[md]the router (or layer 3 switch, now commonly just called a switch, to the confusion of just about everyone) being the most common example[md]has been the focus up to this point. Now it is time to begin connecting routers together. Consider the network in Figure P2-1.
主机A上运行的应用程序需要从F上运行的进程获取一些信息。设备B、C、D和E当然是数据包处理器(路由器)。为了在主机 A 和 F 之间转发数据包,将调用路由器 B 将数据包转发到 F,即使它没有连接到 F;同样,路由器 C 和 D 需要将数据包转发到 A 和 F,即使它们没有连接到这些主机。
An application running on host A needs to obtain some information from a process running on F. Devices B, C, D, and E are, of course, packet processors (routers). To forward packets between hosts A and F, router B is going to be called on to forward packets to F, even though it is not connected to F; likewise, routers C and D are going to need to forward packets to both A and F, even though they are connected to neither of these hosts.
那么,第二部分提出的问题是:
The question posed in this Part II, then, is this:
网络设备如何构建沿着网络中的无环路路径转发数据包所需的表?
How do network devices build the tables needed to forward packets along loop-free paths through the network?
The answer is much more complex than it might immediately appear, for there are actually several problems contained within this one:
• 设备如何了解网络拓扑[md]哪些链路连接到哪些设备和目的地?
• How do devices learn about the topology of the network[md]which links are connected to what devices and destinations?
• 控制平面如何获取这些信息并通过网络构建无环路路径?
• How do control planes take this information and build loop-free paths through the network?
• 控制平面如何检测网络变化并做出反应?
• How do control planes detect and react to changes in the network?
• 如何扩展控制平面以满足大规模网络的需求?
• How are control planes scaled to meet the needs of large scale networks?
• 在控制平面中实施了哪些策略以及如何实施?
• What policies are implemented in the control plane, and how?
本部分的每一章都解决了前面列表中提出的较大问题的一个或多个子问题。两章还专门介绍了控制平面的示例,以展示如何通过广泛部署的协议来实现问题和解决方案。第二部分的章节包括:
Each chapter in this part addresses one or more of the sub-problems of the larger question asked in the preceding list. Two chapters are also dedicated to examples of control planes, to show how the problems and solutions have been implemented by widely deployed protocols. The chapters in Part II include:
•第 11 章:拓扑发现,考虑控制平面如何发现网络拓扑和可达性信息
• Chapter 11: Topology Discovery, which considers how a control plane discovers the network topology and reachability information
•第 12 章和第 13章:单播无环路路径,考虑计算通过网络的一组无环路路径的问题,以及针对这组问题的广泛部署的解决方案
• Chapters 12 and 13: Unicast Loop-Free Paths, which consider the problem of calculating a set of loop-free paths through the network, and the widely deployed solutions to this set of problems
•第 14 章:对拓扑变化做出反应,考虑控制平面对网络拓扑变化做出反应的选项
• Chapter 14: Reacting to Topology Changes, which considers the options a control plane has to react to a change in the network topology
•第 15 章:距离矢量控制平面,考虑基于 Bellman-Ford 和扩散更新算法的控制平面
• Chapter 15: Distance Vector Control Planes, which considers control planes based on Bellman-Ford and the Diffusing Update Algorithm
第16章:链路状态和路径向量控制平面,考虑基于Dijkstra最短路径优先算法的路由协议,以及保存路由更新所通过的路径元素列表的路由协议
• Chapter 16: Link State and Path Vector Control Planes, which considers routing protocols based on Dijkstra’s shortest path first algorithm, and routing protocols that keep a list of path elements through which a routing update has passed
•第 17 章:控制平面中的策略,考虑策略在控制平面中需要解决哪些问题,以及这些问题的一系列解决方案
• Chapter 17: Policy in the Control Plane, which considers what problems policy needs to solve in the control plane, and a range of solutions for those problems
•第 18 章:集中控制平面,考虑软件定义网络、可编程网络和其他集中所有或部分策略或无环路路径计算的控制平面
• Chapter 18: Centralized Control Planes, which considers Software Defined Networks, Programmable Networks, and other control planes that centralize all or some of the policy or the calculation of loop-free paths
•第 19 章:故障域和信息隐藏,考虑路由过滤、聚合、汇总和其他形式的路由协议策略
• Chapter 19: Failure Domains and Information Hiding, which considers route filtering, aggregation, summarization, and other forms of routing protocol policy
第20章:信息隐藏示例,考虑链路状态协议中的洪泛域实现和边界网关协议中的路由聚合
• Chapter 20: Examples of Information Hiding, which considers flooding domain implementation in link state protocols and route aggregation in the Border Gateway Protocol
网络图通常仅显示几种类型的设备,包括路由器、交换机、连接到网络的系统(一般来说,各种类型的主机)以及各种类型的设备(例如防火墙)。这些通常通过链接互连,以线表示。图 11-1提供了一个示例。
Network diagrams typically show just a few types of devices, including routers, switches, systems connected to the network (generally speaking, various sorts of hosts), and various sorts of appliances (such as firewalls). These are often interconnected with links, represented as lines. An example is provided in Figure 11-1.
与许多形式的抽象一样,网络图隐藏了大量信息,以使所包含的信息更易于访问。首先,网络图往往是介于网络的逻辑表示和物理表示之间。此类图通常不会显示网络中的每个物理连接;例如,网络图可能将一束链路显示为单个链路,或已复用为多个逻辑链路的单个物理线路(例如以太网或某些其他广播链路,这是多个链路使用的单个物理通道)通信设备)。
Network diagrams, like many forms of abstraction, hide a lot of information to make the information included more accessible. First, network diagrams tend to be somewhere between logical and physical representations of the network. Such diagrams normally do not show every physical connection in the network; for instance, a network diagram may show a bundle of links as a single link, or a single physical wire that has been multiplexed as several logical links (such as Ethernet, or some other broadcast link, which is a single physical channel used by multiple devices to communicate).
笔记
Note
网络工程中的多路复用一词经常会引起一些混淆。许多工程师倾向于认为共享两个虚拟链路(请参阅第 9 章“网络虚拟化”)作为网络复用的唯一形式。然而,只要有多个设备共享单个链路,最终需要某种形式的寻址、基于时间的流量划分或基于频率的流量划分的情况,就会使用多路复用。虚拟化可以看作是复用的第二层,或者是复用之上的复用。
There is often some confusion about the term multiplexing in network engineering. Many engineers tend to think of sharing two virtual links (see Chapter 9, “Network Virtualization”) as the only form of network multiplexing. However, any time there are multiple devices sharing a single link, a situation ultimately requiring some form of addressing, time-based division of traffic, or frequency-based division of traffic, multiplexing is being used. Virtualization can be seen as a second layer of multiplexing, or multiplexing on top of multiplexing.
其次,网络图通常会忽略服务的逻辑复杂性。然而,控制平面无法掩盖这些复杂性。
Second, network diagrams often leave out the logical complexity of services. The control plane, however, cannot mask these sorts of complexities out.
相反,控制平面必须在本地和从其他控制平面收集有关网络的信息,将其通告给运行控制平面的其他设备,并构建一组数据平面可用于跨网络中每个设备转发流量的表,源头到目的地。本章要考虑的问题是:
Instead, the control plane must gather information about the network locally and from other control planes, advertise it to other devices running the control plane, and build a set of tables the data plane can use to forward traffic across each device in the network, from source to destination. This chapter is going to consider the problem:
控制平面如何了解网络?
How does the control plane learn about the network?
这个问题可以分为多个部分:
This question can be broken down into multiple parts:
• 控制平面试图了解什么?或者,网络拓扑的组成部分是什么?
• What is the control plane trying to learn about? Or perhaps, what are the components of a network topology?
• 控制平面如何了解连接到网络的设备?
• How does the control plane learn about devices connected to the network?
• 用于描述网络信息广告的基本分类是什么?
• What are the basic classifications used in describing the advertisement of information about the network?
本章不考虑用于承载网络信息的机制,因为它们通常与计算无环路路径集的方式密切相关。
The mechanisms used to carry information about the network are not considered in this chapter, as they are typically intimately tied to the way in which the set of loop-free paths is calculated.
要解决的第一个问题实际上是一个元问题:控制平面需要学习和分发哪些类型的信息才能通过网络构建无环路路径?然而,对以下部分的警告是:网络术语很难确定,因为单个术语通常用于描述网络中的各种“事物”,具体取决于它们使用的上下文。
The first problem to solve is really a meta-question: what kinds of information does a control plane need to learn and distribute in order to build loop-free paths through a network? A word of warning about the following section, however: Networking terms are difficult to nail down, as individual terms are often used to describe a variety of “things” in the network, depending on the context in which they are used.
网络中的节点可以处理数据包(包括转发数据包)、发送数据包或接收数据包。该术语取自图论,在图论中它们也可以称为顶点,尽管该术语在网络工程中的应用更为宽松。网络中有多种节点,包括
A node either processes packets (including forwarding packets), sends packets, or receives packets in a network. The term is taken from graph theory, where they can also be called vertices, although this term is more loosely applied in network engineering. There are several kinds of nodes in a network, including
•传输节点:任何设计为在一个接口上接受数据包、以某种方式处理它们并在另一个接口上发送它们的设备。传输节点的例子有路由器和交换机;它们通常被称为节点,因为它们将在这里,而不是传输节点。
• Transit node: Any device that is designed to accept packets on one interface, process them in some way, and send them on another interface. Examples of transit nodes are routers and switches; they are often just called nodes, as they will be here, rather than transit nodes.
•叶节点:也称为端系统或主机;设计用于运行从一个或多个接口生成和/或接受数据包的应用程序的任何设备。这些是网络源和汇;大多数情况下,这些节点实际上被称为主机,而不是叶节点,以区别于简写节点(通常表示传输节点)。
• Leaf node: Also called an end system or host; any device designed to run applications that generate and/or accept packets from one or more interfaces. These are network sources and sinks; most often these nodes are actually called hosts, rather than leaf nodes, to differentiate them from the shorthand nodes, which typically means a transit node.
这两个定义中有许多明显的漏洞。如果设备在一个接口上接受数据包,终止本地进程或应用程序中的连接,生成新数据包,然后将该新数据包从不同的接口传输出去,那么该设备应该被称为什么?如果问题变得更加困难两个数据包中包含的信息大致相同,就像代理服务器或其他类似设备的情况一样。在这些情况下,根据设备相对于上下文中的其他设备所扮演的角色,将设备分类为特定上下文中的叶或节点是有用的。举个例子,从主机的角度来看,代理服务器充当网络转发设备,因为代理服务器的操作对于主机来说是(某种程度上)透明的。然而,从相邻节点的角度来看,代理服务器是主机,因为它们终止流量,并且(通常)以与主机相同的方式参与控制平面。
There are many readily apparent holes in these two definitions. What should a device be called that accepts a packet on one interface, terminates the connection in a local process or application, generates a new packet, and then transmits that new packet out of a different interface? The problem becomes more difficult if the information contained in the two packets is roughly the same, as in the case of a proxy server, or some other similar device. In these cases, it is useful to classify the device as either a leaf or a node within a specific context, depending on the role it is taking in relation to other devices within the context. To give an example, from the perspective of a host, a proxy server acts as a network forwarding device, as the operation of the proxy server is (somewhat) transparent to the host. From the perspective of an adjacent node, however, proxy servers are hosts, as they terminate traffic streams, and (generally) participate in the control plane the same way a host would.
边缘是两个网络设备之间转发数据包的任何连接。名义情况是连接两个路由器的点对点链路,但这不是唯一的情况。在图论中,边仅精确连接两个节点。在网络工程中,存在多路复用、多点和其他类型的多路复用链路的概念。这些通常被建模为一组点对点链路,特别是在通过网络构建一组无环路路径时。然而,在网络图中,多路复用链路通常被绘制为连接有多个节点的单个链路。
An edge is any connection between two network devices across which packets are forwarded. The nominal case is a point-to-point link connecting two routers—but this is not the only case. In graph theory, an edge only connects precisely two nodes. In network engineering, there are the notions of multiplexed, multipoint, and other kinds of multiplexed links. These are most often modeled as a set of point-to-point links, particularly when building a set of loop-free paths through the network. In network diagrams, however, multiplexed links are often drawn as a single link with multiple nodes attached.
可到达的目的地可以描述可通过网络到达的单个主机或服务,或者一组主机或服务。可到达目的地的名义示例是子网上的一台主机或一组主机,但重要的是要记住该术语还可以描述某些上下文中的服务,例如在单个设备或多个设备上运行的特定进程。在许多设备上可用的服务的副本。图 11-2说明了这一点。
A reachable destination can describe a single host or service, or a set of hosts or services, reachable through the network. The nominal example of a reachable destination is either a host or a set of hosts on a subnet, but it is important to remember the term can also describe a service in some contexts, such as a particular process running on a single device, or many copies of a service available on a number of devices. Figure 11-2 illustrates.
在图11-2所示的网络中,可达目的地可能包括
In the network illustrated in Figure 11-2, reachable destinations may include
• 任何单个主机,例如 A、D、F、G 和 H
• Any of the individual hosts, such as A, D, F, G, and H
• 任何单个节点,例如 B、C 或 E
• Any of the individual nodes, such as B, C, or E
• 在单个主机(例如 S2)上运行的服务或进程
• A service or process running on a single host, such as S2
• 在多个主机上运行的服务或进程,例如 S1
• A service or process running on multiple hosts, such as S1
• 连接到单个物理链路或边缘的一组设备,例如 F、G 和 H
• A set of devices attached to a single physical link, or edge, such as F, G, and H
最后一个可到达的目的地也表示为网络中特定链路或边缘的接口。因此,路由器 E 可以有许多可到达的目的地,包括
This last reachable destination is also represented as an interface onto a particular link or edge in the network. Hence, router E could have a number of reachable destinations, including
• 连接路由器E 到C 的链路上的接口
• The interface onto the link connecting router E to C
• 连接路由器E 到B 的链路上的接口
• The interface onto the link connecting router E to B
• 将路由器 E 连接到主机 F、G 和 H 的链路上的接口
• The interface onto the link connecting router E to the hosts F, G, and H
• 表示主机 F、G 和 H 可达性的网络
• The network representing reachability to the hosts F, G, and H
• 可以作为单独地址、端口或协议号进行广告的任意数量的内部服务
• Any number of internal services that might be advertised as individual addresses, ports, or protocol numbers
• 连接到物理网络中不存在的虚拟链路的任意数量的内部地址,但可能用于表示设备内的内部状态(图 11-2 中未显示)
• Any number of internal addresses attached to virtual links that do not exist in the physical network, but might be used to represent internal state within the device (not shown in Figure 11-2)
因此,根据上下文的不同,可到达目的地的概念可能有很多不同的含义。在大多数网络中,可到达的目标要么是单个主机、单个链路(以及连接到该链路的主机),要么是聚合成单个可到达目的地的一组链路(以及连接到这些链路的主机)。
The concept of a reachable destination, then, can mean a lot of different things depending on the context. In most networks, a reachable destination is either a single host, a single link (and the hosts attached to the link), or a set of links (and the hosts attached to those links) aggregated into a single reachable destination.
笔记
Note
第 5 章“高层数据传输”中提供了聚合可达目的地的示例。使用较短的前缀长度 IP 地址来表示一组较长的前缀子网是聚合的一种形式。
An example of reachable destinations being aggregated is provided in Chapter 5, “Higher Layer Data Transports.” Using a shorter prefix length IP address to represent a set of longer prefix subnets is a form of aggregation.
拓扑是描述整个网络的链路(或边)和节点的集合。通常,拓扑被描述和绘制为图形,但也可以用设计为供机器使用的数据结构或树(通常设计为供人类使用)来表示。
The topology is the set of links (or edges) and nodes that describe the entire network. Normally, the topology is described and drawn as a graph, but it can also be represented in a data structure designed to be consumed by machines, or a tree, which is normally designed to be consumed by humans.
拓扑信息可以通过简单地使物理(或虚拟)连接几跳的目的地看起来直接连接到本地节点,然后从控制平面中携带的任何路由信息中删除有关链路和节点的信息来概括。总结的要点。图 11-3说明了这个概念
Topological information can be summarized by simply making destinations that are physically (or virtually) connected several hops away appear to be directly attached to a local node, and then removing the information about the links and nodes in any routing information carried in the control plane from the point of summarization. Figure 11-3 illustrates this concept
了解网络拓扑似乎很简单:检查附加的链接。然而,网络中看似简单的事情往往会变得复杂。检查本地接口可以告诉您有关链路的信息,但不能告诉您连接到该链路的其他网络设备的信息。此外,即使您可以检测到在特定链路上运行相同控制平面的另一个网络设备,这并不意味着其他设备可以检测到您。那么,有几个问题需要探讨。
It would seem simple enough to learn about the network topology: examine the attached links. What appears simple in networks, however, often turns out to be complex. Examining the local interface can tell you about the link, but not about other network devices attached to the link. Further, even if you can detect another network device running the same control plane on a particular link, this does not mean the other device can detect you. There are, then, several issues to explore.
假定路由器 A、B 和 C 连接到单个链路(如图11-4所示),它们可以使用什么机制来相互检测并交换有关其功能的信息?
Given routers A, B, and C are attached to a single link, as illustrated in Figure 11-4, what mechanisms can they use to detect one another, as well as exchange information about their capabilities?
关于图11-4左侧所示的网络,首先要注意的一点是接口不与邻居相对应。实际邻居关系如图11-4右侧所示。该网络中的每台路由器都有两个邻居,但只有一个接口。这就说明了控制平面不能利用接口信息来发现邻居;控制平面必须使用其他一些机制来查找邻居。
The first point to note about the network shown on the left side of Figure 11-4 is the interfaces do not correspond to neighbors. The actual neighbor relationships are shown on the right side of Figure 11-4. Each router in this network has two neighbors, but only one interface. This illustrates the point that the control plane cannot use interface information to discover neighbors; there must be some other mechanism the control plane can use to find neighbors.
手动配置是解决此问题的一种广泛部署的解决方案。特别是在设计用于覆盖另一个控制平面的控制平面或设计用于通过网络跨多个路由跃点建立邻居关系的控制平面中,手动配置通常是最简单的可用机制。从复杂性的角度来看,手动配置对协议本身的影响很小;例如,不需要任何形式的多播邻居通告。另一方面,手动配置邻居确实需要配置邻居信息,从配置角度来看这增加了复杂性。在图11-4的网络中,路由器 A 需要与 B 和 C 配置邻居关系,路由器 B 需要与 A 和 C 配置邻居关系,路由器 C 需要与 A 和 B 配置邻居关系。自动化、手动配置加深和拓宽了管理和控制平面之间的交互面。
Manual configuration is one widely deployed solution to this problem. Particularly in control planes designed to overlay another control plane, or control planes designed to build neighbor relationships across multiple routed hops through the network, manual configuration is often the easiest mechanism available. From a complexity perspective, manual configuration adds very little to the protocol itself; there is no need for any form of multicast neighbor advertisements, for instance. On the other hand, manual configuration of neighbors does require configuring the neighbor information, which increases complexity from a configuration point of view. In the network in Figure 11-4, router A would need to have neighbor relationships configured with B and C, router B would need to have neighbor relationships configured with A and C, and router C would need to have neighbor relationships configured with A and B. Even if the configuration of neighbors is automated, manual configuration deepens and broadens the interaction surfaces between the management and control planes.
从路由通告推断邻居是一种曾经很普遍的解决方案,但现在已经不那么常见了。在此方案中,每个设备定期通告可达性和/或拓扑信息。路由器第一次从其他设备接收路由信息时,会将远程设备添加到本地邻居表中。只要相邻设备继续定期发送路由信息,邻居关系就会被视为活动或up。
Inferring neighbors from routing advertisements is a solution that was once widespread, but has become less common. In this scheme, each device advertises reachability and/or topology information on a periodic basis. The first time a router receives routing information from some other device, it adds the remote device to a local neighbor table. So long as a neighboring device continues sending routing information on a regular basis, the neighbor relationship will be considered active, or up.
当从路由通告推断邻居时,能够确定邻居何时发生故障非常重要(因此可以从任何本地表中删除从邻居获知的可达性和拓扑信息)。解决此问题的最常见方法是使用一对计时器:保持或失效计时器以及更新或通告计时器。只要邻居在死亡或保持定时器内发送更新或通告,就被认为是启动或活动的。如果整个死区时间过去了而没有收到任何更新,则邻居被视为死区,并且会采取一些操作来验证从邻居获知的拓扑和可达性信息,或者只是将其从表中删除。
When inferring neighbors from routing advertisements, it is important to be able to determine when a neighbor has failed (so reachability and topology information learned from the neighbor can be removed from any local tables). The most common way to solve this problem is with a pair of timers: the hold or dead timer, and the update or advertisement timer. So long as the neighbor sends an update or advertisement within the dead or hold timer, it is considered up or active. If an entire dead period passes without receiving any updates, the neighbor is considered dead, and some action is taken to either validate the topology and reachability information learned from the neighbor, or it is simply removed from the table.
失效定时器和更新定时器之间的正常关系是 3×——失效定时器设置为更新定时器的三倍。因此,如果邻居没有连续发送三个更新或通告,则失效定时器将唤醒,并开始处理关闭的邻居。
The normal relationship between the dead and update timer is 3×—the dead timer is set to three times the update timer. Hence, if a neighbor does not send three updates or advertisements in a row, the dead timer wakes up, and begins processing the down neighbor.
显式问候是最常见的邻居发现机制。Hello 报文的发送基于Hello 定时器,如果在Dead 或Hold 定时器的时间内没有收到Hello,则认为邻居已死亡。这类似于从路由通告推断邻居时使用的失效定时器和更新定时器。Hello 通常包含有关相邻系统的信息,例如支持的功能、设备级别标识符等。
Explicit hellos are the most common neighbor discovery mechanism. Hello packets are transmitted based on a hello timer, and the neighbor is considered dead if a hello is not received during the interval of a dead or hold timer. This is similar to the dead and update timers used in inferring neighbors from routing advertisements. Hellos typically contain information about the neighboring system, such as capabilities supported, device level identifiers, etc.
集中注册是有时用于发现和传播有关相邻设备的信息的另一种机制。连接到网络的每个设备都会将有关其自身的信息发送到某些服务,然后依次了解从该集中服务连接到网络的其他设备。当然,必须以某种方式发现这种集中式服务,这通常是使用提到的其他机制之一来完成的。
Centralized registration is another mechanism sometimes used to discover, and propagate information about, neighboring devices. Each device connecting to the network will send information about itself to some service, and, in turn, learn about other devices connected to the network from this centralized service. This centralized service must somehow be discovered, of course, which is generally accomplished using one of the other mechanisms mentioned.
在具有更复杂邻接形成过程的控制平面中(尤其是依赖 hello 形成邻居关系的协议),在形成关系之前检测两个路由器是否可以互相看到(双向通信)非常重要。确保双向连接不仅可以防止单向链路渗入转发表,还可以防止邻居形成的持续循环 - 发现新邻居、构建正确的本地表、通告新邻居的可达性、超时等待问候或其他一些信息,删除邻居,或发现新邻居。管理网络设备之间的双向连接有三种广泛的选项。
In control planes with more complex adjacency formation processes—particularly protocols that rely on hellos to form neighbor relationships—it is important to detect if two routers can see one another (communicate bidirectionally) before forming a relationship. Ensuring two-way connectivity not only prevents unidirectional links from creeping into the forwarding table, but it also prevents a constant cycle of neighbor formation—discover a new neighbor, build the correct local tables, advertise reachability to the new neighbor, time out waiting for a hello or some other information, remove the neighbor, or discover the new neighbor. There are three broad options in managing two-way connectivity between network devices.
不必费心检查双向连接。某些协议不会尝试确定控制平面中的网络设备之间是否存在双向连接,而是假设从中接收数据包的邻居也必须可达。
Do not bother checking for two-way connectivity. Some protocols do not try to determine if two-way connectivity exists between network devices in the control plane, but rather assume a neighbor from which packets are being received must also be reachable.
携带通过链接听到的邻居名单。对于使用 hello 来发现邻居并维护活动性的协议,在同一链路上携带可达邻居列表是确保双向连接存在的常用方法。图 11-5说明了这一点。
Carry a list of neighbors heard from on the link. For protocols that use hellos to discover neighbors and maintain liveness, carrying a list of reachable neighbors on the same link is a common method to ensure two-way connectivity exists. Figure 11-5 illustrates.
在图11-5中,假设路由器A先于B上电,此时:
In Figure 11-5, assume router A is powered on before B. In this case:
1. A 将发送带有空邻居列表的 hello,因为它没有听到来自链路上任何其他网络设备的 hello。
1. A will send hellos with an empty neighbor list, as it has not heard hellos from any other network device on the link.
2. 当 B 通电时,它将收到 A 的 hello,因此将 A 包含在它在其 hello 数据包中听到的邻居列表中。
2. When B is powered on, it will receive A’s hello, and hence include A in a list of neighbors it has heard in its hello packets.
3. 当 A 收到 B 的 hello 时,它会依次将 B 包含在其 hello 数据包的“听到”邻居列表中。
3. When A receives B’s hello, it will, in turn, include B in its “heard from” neighbor list in its hello packets.
4. 当 A 和 B 在其“收到的”邻居列表中相互报告时,两个路由器都可以确定双向连接已建立。
4. When both A and B are reporting one another in their “heard from” neighbor lists, both routers can be certain two-way connectivity has been established.
此过程通常称为三向握手,基于三个步骤:
This process is often called a three-way handshake, based on the three steps:
1. A 必须向 B 发送 hello,以便 B 可以将 A 包含在其邻居列表中。
1. A must send a hello to B, so B can include A in its neighbor list.
2. B 必须收到 A 的 hello,并将 A 包含在其邻居列表中。
2. B must receive A’s hello, and include A in its neighbor list.
3. A 必须收到 B 与 B 的邻居列表中的自身 (A) 的问候。
3. A must receive B’s hello with itself (A) in B’s neighbor list.
依赖底层传输协议。最后,控制平面可以依赖底层传输机制来确保双向连接的存在。这是一种不常见的解决方案,但有一些广泛部署的解决方案。例如,第 16 章“链路状态和路径矢量控制平面”中解释的边界网关协议 (BGP) 依赖于第5 章“高层数据传输”中考虑的传输控制协议 (TCP),以确保BGP 扬声器之间的双向连接。
Rely on an underlying transport protocol. Finally, control planes can rely on an underlying transport mechanism to ensure two-way connectivity exists. This is an uncommon solution, but there are some widely deployed solutions. For instance, the Border Gateway Protocol (BGP), explained in Chapter 16, “Link State and Path Vector Control Planes,” relies on the Transmission Control Protocol (TCP), considered in Chapter 5, “Higher Layer Data Transports,” to ensure two-way connectivity between BGP speakers.
对于控制平面来说,不仅仅检查双向连接通常很有用。许多控制平面还会检查以确保链路上两个接口上的最大传输单元 (MTU) 配置为相同的 MTU。图 11-6说明了通过控制平面中的链路级 MTU 检查解决的问题。
It is often useful for a control plane to move beyond just checking for two-way connectivity. Many control planes also check to make certain the Maximum Transmission Unit (MTU) on both interfaces onto the link are configured with the same MTU. Figure 11-6 illustrates the problem being solved with a link-level MTU check in the control plane.
当同一链路上两个接口之间的MTU不匹配时,有可能形成邻居关系,但网络设备之间无法承载路由等信息。虽然许多协议具有某种机制来防止有关生成的单向链路的信息用于计算通过网络的无环路路径,但检测这种情况仍然很有用,以便可以明确报告和修复。控制平面协议通常使用多种技术来显式检测这种情况,或者至少防止邻居形成的初始阶段发生。
In a situation where the MTU is mismatched between two interfaces on the same link, it is possible for a neighbor relationship to form but routing and other information to fail to be carried between the network devices. While many protocols have some mechanism to prevent information about the resulting unidirectional links from being used in calculating loop-free paths through the network, it is still useful to detect this situation so it can be explicitly reported and repaired. Several techniques are commonly used by control plane protocols to either explicitly detect this condition, or to at least prevent the initial stages of neighbor formation from taking place.
控制平面协议可以将本地配置的 MTU 包含在 hello 数据包的字段中。每个路由器不仅可以在三向握手期间检查邻居是否存在,还可以在将新检测到的网络设备添加为邻居之前检查以确保链路两端的 MTU 匹配。
The control plane protocol can include the locally configured MTU in a field in the hello packets. Rather than just checking for the existence of a neighbor during the three-way handshake, each router can also check to make certain the MTU on both ends of the link match before adding a newly detected network device as a neighbor.
另一种选择是将 hello 数据包填充到本地接口的 MTU。如果链路上的其他设备未收到填充的最大尺寸 hello 数据包,则邻居关系的初始阶段将不会完成。如果两个设备都没有收到对方的 hello 数据包,则三向握手无法完成。
Another option is to pad the hello packets to the MTU of the local interface. If the padded, maximum-sized, hello packet is not received by some other device on the link, the initial stages of the neighbor relationship will not complete. The three-way handshake cannot be completed if both devices are not receiving one another’s hello packets.
最后,控制平面协议可以依赖底层传输来调节数据包大小,以便通信设备可以接收它们。该机制主要用于旨在覆盖其他控制平面的控制平面,特别是在域间路由和网络虚拟化的情况下。覆盖控制平面通常依赖路径 MTU (PMTU) 发现来在通过多跳连接的两个设备之间提供准确的 MTU。
Finally, the control plane protocol can rely on an underlying transport to regulate packet sizes so the communicating devices can receive them. This mechanism is primarily used in control planes designed to overlay some other control plane, particularly in the case of interdomain routing and network virtualization. Overlay control planes often rely on Path MTU (PMTU) discovery to provide an accurate MTU between two devices connected through multiple hops.
MTU 大小本身会对控制平面的收敛速度性能产生很大影响。例如,假设协议必须通过多跳链路以 500 毫秒的延迟发送描述 500,000 个目的地的信息,并且每个目的地需要 512 位来描述:
The MTU size itself can have a large impact on the performance of a control plane in terms of its speed of convergence. For instance, assume a protocol must send information describing 500,000 destinations over a multihop link with 500ms of delay, and each destination requires 512 bits to describe:
• 如果MTU 小于1,000 位,则控制平面将需要500,000 次往返来交换可到达目的地的整个数据库,或大约500,000 × 500ms,即250,000 秒或接近70 小时。
• If the MTU is less than 1,000 bits, the control plane will require 500,000 round trips to exchange the entire database of reachable destinations, or around 500,000 × 500ms, which is 250,000 seconds, or close to 70 hours.
• 如果MTU 为1,500 个八位位组或12,000 位,则控制平面将需要大约21,000 次往返来描述可到达目的地的整个数据库,或者大约21,000 × 500 毫秒,即大约175 分钟。
• If the MTU is 1,500 octets, or 12,000 bits, the control plane will require around 21,000 round trips to describe the entire database of reachable destinations, or around 21,000 × 500ms, which is around 175 minutes.
压缩此类数据库、使用某种窗口机制来减少交换可达性信息所需的完整往返次数并增加 MTU 的重要性是显而易见的。
The importance of compressing such a database, using some sort of windowing mechanism to reduce the number of full round trips required to exchange the reachability information and increasing the MTU, is readily apparent.
邻居发现允许控制平面了解网络的拓扑结构,但是如何了解可达目的地的信息呢?如图11-7所示,路由器D是如何获知主机A、B、C的?
Neighbor discovery allows the control plane to learn about the topology of the network, but how is information about reachable destinations learned? In Figure 11-7, how does router D learn about hosts A, B, and C?
此问题有两大类解决方案——被动式和主动式——将在以下各节中讨论。
There are two broad classes of solutions to this problem—reactive and proactive—discussed in the following sections.
在图11-7中,假设主机A刚刚开机,网络仅使用基于传输数据流量的动态学习。路由器 D 如何了解这个新连接的主机?一种可能性是 A 简单地开始发送数据包。例如,如果 A 被手动配置为将所有数据包发送到目的地,它不知道如何到达(本质上,任何超出网段的数据包,第 6 章“层间发现”中考虑的概念)”)到D,A必须发送至少一个数据包让D发现它的存在。在了解 A 后,D 可以将任何相关信息缓存一段时间——通常只要 A 看起来正在发送流量。如果 A 在一段时间内没有发送流量,D 可以将其本地缓存中 A 的条目计时。
In Figure 11-7, assume host A has just been powered on, and the network is only using dynamic learning based on transmitted data traffic. How can router D learn about this newly attached host? One possibility is for A to simply start sending packets. For instance, if A is manually configured to send all packets toward destinations it does not know how to reach (essentially, anything that is off segment, a concept considered in Chapter 6, “Interlayer Discovery”) to D, A has to send at least one packet for D to discover its existence. On learning of A, D can cache any relevant information for some time—generally for as long as A appears to be sending traffic. If A does not send traffic for some time, D can time the entry for A in its local cache out.
这种根据实际流量发现可达性的过程是反应性发现。从复杂性的角度来看,反应式发现将最佳流量与控制平面中已知的和可能携带的信息进行交换。
This process of discovering reachability based on actual traffic flow is reactive discovery. From a complexity perspective, reactive discovery trades optimal traffic flow against the information known about, and potentially carried, in the control plane.
反应式发现机制需要一定的时间才能运行——也就是说,一旦主机开始发送数据包,D 才能了解 A 的存在。例如,如果主机 F 在 A 通电时开始向 A 发送流量,则流量可能会通过网络转发到 D,但 D 将不具有将流量转发到链路上并因此转发到 A 所需的信息。从主机 A 通电到 D 发现其存在之间的时间,数据包将被丢弃 - 对于 F 来说,这种情况在最坏的情况下会出现网络故障,以及一些额外的抖动(或者可能是整个网络中不可预测的响应) ) 最好。
It will take some amount of time for reactive discovery mechanisms to operate— that is, for D to learn about the existence of A once the host starts sending packets. For instance, if host F begins sending traffic toward A the moment A is powered on, traffic may be forwarded through the network to D, but D will not have the information required to forward the traffic onto the link, and hence to A. During the time between host A being powered on and D discovering its existence, packets will be dropped—a situation that will appear, to F, to be a network failure at the worst, and some additional jitter (or perhaps an unpredictable response across the network) at best.
随着时间的推移,缓存的条目将需要超时。这通常需要平衡许多因素,包括缓存有多大、缓存了多少设备信息以及过去某个时间段内缓存条目的使用频率。
Cached entries will need to be timed out over time. This will normally require balancing a number of factors, including how large the cache is, how much device information is cached, and how often the cache entry has been used in some past time period.
使该缓存信息超时需要多长时间以及使用过时信息的其他设备的任何安全风险是攻击的基础。例如,如果 A 将其连接从 D 移动到 E,则 D 了解到的有关 A 的信息将在 D 的缓存中保留一段时间。在此期间,如果另一台设备连接到D的网络,则它可以冒充A。缓存信息的有效时间越长,执行此类攻击的可能性就越大。
How long it will take to time out this cached information and any security risk of some other device using stale information is the foundation for an attack. For instance, if A moves its connection from D to E, the information D has learned about A will remain in D’s cache for some time. During this time, if another device connects to the network to D, it can impersonate A. The longer cached information is valid, the more possible it is to execute this type of attack.
一些可达性信息可以主动获知,这意味着路由器不需要等待连接的主机开始发送流量来了解它。在主机高度移动的环境中,此功能往往很重要;例如,在数据中心结构中,虚拟机可以在物理设备之间移动,同时保留其地址或其他识别信息,或者在支持无线设备(例如移动电话)的网络中。有四种广泛使用的主动学习可达性信息的方法,如下所示:
Some reachability information can be learned proactively, which means the router does not need to wait for an attached host to start sending traffic to learn about it. This capability tends to be important in environments where hosts can be highly mobile; for instance, in a data center fabric where virtual machines may move between physical devices while keeping their address or other identifying information, or in networks that support wireless devices, such as mobile phones. There are four widely used ways to learn reachability information proactively, covered here:
•邻居发现协议可以在边缘网络节点(或设备)和连接的主机之间运行。然后,从此类邻居发现协议中获知的信息可用于在控制平面中注入可达性信息。虽然邻居发现协议得到广泛部署,但通过这些协议学习到的信息并未广泛用于将可达性信息注入控制平面。
• A neighbor discovery protocol can be run between the edge networking nodes (or devices) and connected hosts. The information learned from such a neighbor discovery protocol can then be used to inject reachability information in the control plane. While neighbor discovery protocols are widely deployed, the information learned through these protocols is not widely used to inject reachability information into the control plane.
•可通过设备配置了解可达性信息。几乎所有的网络设备(例如路由器)都会有一个可达的地址在所有面向主机的接口上配置或发现。然后,网络设备可以将这些连接的接口通告为可到达的目的地。在这种情况下,链路(或线路)、网络或子网是可到达的目的地,而不是单个主机。这是路由器学习网络层可达性信息最常用的方式。
• Reachability information can be learned through device configuration. Almost all network devices (such as routers) will have a reachable address configured or discovered on all host-facing interfaces. Network devices can then advertise these attached interfaces as reachable destinations. In this situation, the link (or wire), the network, or the subnet is the reachable destination, rather than individual hosts. This is the most common way for routers to learn network layer reachability information.
•主机可以向身份服务注册。在某些系统中,服务(无论是集中式还是分布式)会跟踪主机的连接位置,包括诸如第一跳路由器(通过该路由器发送流量以到达它们)、名称到地址映射、每个主机能够提供的服务等信息、每个主机正在搜索和/或使用的服务以及其他信息。身份服务很常见,尽管它们对于网络工程师来说并不常见。此类系统在高移动性环境中非常常见,例如面向消费者的无线网络。
• Hosts can register with an identity service. In some systems, a service (whether centralized or distributed) keeps track of where hosts are attached, including such information as the first hop router through which traffic should be sent to reach them, name to address mapping, services each host is capable of providing, services each host is searching for and/or using, and other information. Identity services are common, although they are not often highly visible to network engineers. Such systems are very common in high mobility environments, such as consumer-facing wireless networks.
•如果在整个网络中部署了地址管理系统,则控制平面可以从地址管理系统中提取信息。然而,这是一个非常不常见的解决方案。控制平面和地址管理系统之间的大部分交互将通过本地设备配置进行;地址管理系统为接口分配地址,控制平面选取此接口配置作为可到达的目的地进行通告。
• The control plane can pull information from an address management system, if one is deployed throughout the network. This is a very uncommon solution, however. Most of the interaction between the control plane and address management systems would be through local device configuration; the address management system assigns an address to an interface, and the control plane picks up this interface configuration to be advertised as a reachable destination.
一旦了解了拓扑和可达性信息,控制平面必须通过网络分发此信息。虽然用于通告此信息的方法在某种程度上取决于用于计算无环路路径的机制(因为计算无环路路径的位置需要哪些信息将根据这些路径的计算方式而变化),但存在一些常见问题以及适用于所有可能系统的解决方案。主要问题是决定何时通告可达性以及通过网络可靠地传输信息。
Once topology and reachability information are learned, the control plane must distribute this information through the network. While the method used to advertise this information is somewhat dependent on the mechanism used to calculate loop-free paths (as which information is required where to calculate loop-free paths will vary depending on how these paths are calculated), there are some common problems and solutions that will apply to every possible system. The primary problems are deciding when to advertise reachability and reliably transporting information through the network.
控制平面什么时候应该通告拓扑和可达性信息?显而易见的答案可能是“当它被学会时”——但显而易见的答案往往是错误的答案。确定何时通告信息实际上涉及最佳网络性能和管理控制平面状态量之间的仔细平衡。用图11-8来说明。
When should the control plane advertise topology and reachability information? The obvious answer might be “when it is learned”—but the obvious answer is often the wrong answer. Determining when to advertise information actually involves a careful balance between optimal network performance and managing the amount of control plane state. Figure 11-8 will be used to illustrate.
假设主机 A 和 F 几乎不断地相互发送数据,但 B、G 和 H 在一段较长的时间内根本不发送流量。在这种情况下出现两个明显的问题:
Assume hosts A and F are sending data to one another almost constantly, but B, G, and H do not send traffic at all for some extended period. Two obvious questions arise in this situation:
• 虽然路由器C 维护有关B 的可达性信息可能有意义,但为什么D 和E 应该维护此信息?
• While it might make sense for router C to maintain reachability information about B, why should D and E maintain this information?
• 路由器E 为什么要维护主机A 的可达性信息?
• Why should router E maintain reachability information about host A?
从复杂性的角度来看,控制平面中携带和保存的信息量与网络快速接受和转发流量的能力之间存在直接的权衡。例如,考虑第一个问题,权衡表现为 C 在接收流量时将流量从 B 发送到 G 的能力与 C 在其转发表中维护较少信息的能力,但需要在接收时通过某种机制获取转发流量所需的信息需要转发的数据包。这个问题有三种广泛的解决方案。
From a complexity perspective, there is a direct tradeoff between the amount of information carried and held in the control plane and the ability of the network to accept and forward traffic quickly. Considering the first question, for instance, the tradeoff appears as C’s ability to send traffic from B to G on receiving it versus C maintaining less information in its forwarding tables, but being required to obtain the information required to forward traffic through some mechanism on receiving packets that need to be forwarded. There are three broad solutions to this problem.
•主动控制平面:控制平面可以主动发现拓扑,计算一组通过网络的无环路路径,并通告可达性信息。
• A Proactive Control Plane: The control plane can proactively discover the topology, calculate a set of loop-free paths through the network, and advertise reachability information.
•具有反应可达性的主动拓扑发现:控制平面可以主动发现拓扑并计算一组无环路路径。然而,控制平面可以等到需要可达性信息来转发分组之后才发现和/或通告可达性。
• Proactive Topology Discovery with Reactive Reachability: The control plane can proactively discover the topology and calculate a set of loop-free paths. However, the control plane can wait until reachability information is needed to forward packets before discovering and/or advertising reachability.
•反应性控制平面:控制平面可以反应性地发现拓扑,计算一组通过网络的无环路路径(通常基于每个目的地),并通告可达性信息。
• A Reactive Control Plane: The control plane can reactively discover the topology, calculate a set of loop-free paths through the network (generally on a per destination basis), and advertise reachability information.
如果 C 主动学习、保留和分发可达性信息,或者该网络正在运行主动控制平面,则新的流量可以通过网络转发,不会有任何延迟。如果所示设备正在运行反应式控制平面,C 将
If C learns, keeps, and distributes reachability information proactively, or this network is running a proactive control plane, then new flows of traffic can be forwarded through the network without any delays. If the devices illustrated are running a reactive control plane, C would
• 等待直到第一个数据包流向G(例如)
• Wait until the first packet in the flow toward G (for instance)
• 使用某种机制发现到G的路径
• Discover the path to G using some mechanism
• 本地安装路径
• Install the path locally
• 开始将流量转发至G
• Begin forwarding traffic toward G
对于从 G 和 F 转发到 A 的流量,需要在 D 执行相同的过程(记住流量几乎总是双向的)。在控制平面学习到目的地的路径期间,流量(几乎总是)被丢弃,因为网络设备没有该可达目的地的任何转发信息(从网络设备的角度来看,可达目的地不存在) )。发现并构建正确的转发信息所需的时间可能在几百毫秒到几秒之间;在此期间,主机和应用程序将不知道最终是否会建立连接,或者目的地是否无法到达。
The same process would need to be performed at D for traffic being forwarded toward A from G and F (remember flows are almost always bidirectional). During the time the control plane is learning a path to the destination, traffic is (almost always) being dropped, because the network devices do not have any forwarding information for this reachable destination (from the network device’s perspective, the reachable destination does not exist). The time required to discover and build the correct forwarding information may fall between a few hundred milliseconds to a few seconds; during this time, the host and applications will not know whether or not connectivity will eventually be established, or if the destination is just unreachable.
控制平面大致可分为
Control planes can be broadly classified into
• 主动系统在需要之前在整个网络中通告可达性信息。另一种表达方式是,主动控制平面保留每个网络设备上安装的每个目的地的可达性信息,无论该信息是否正在使用。主动系统增加了控制平面中携带和存储的状态量,使网络对主机更加透明,或者更适合短期和时间敏感的流。
• Proactive systems advertise reachability information throughout the network before it is needed. Another way to phrase this is to say proactive control planes keep reachability information for every destination installed at every network device, regardless of whether the information is being used or not. Proactive systems increase the amount of state carried and stored in the control plane to make the network more transparent to hosts, or rather more optimal for short-lived and time-sensitive flows.
• 反应式系统等到需要转发信息时才获取信息,或者更确切地说,它们对数据平面中的事件做出反应以构建控制平面信息。反应式系统使网络对应用程序的响应降低,并且对于短期或时间敏感的流不太理想,从而减少了控制平面中携带的状态量。
• Reactive systems wait until forwarding information is needed to obtain it, or rather they react to the events in the data plane to build control plane information. Reactive systems decrease the amount of state carried in the control plane by making the network less responsive to applications, and less optimal for short-lived or time-sensitive flows.
与网络工程中的所有权衡一样,这里描述的两个选项并不是唯一的。可以实现包含一些主动元素和一些反应元素的控制平面。例如,可以构建一个控制平面,该控制平面具有描述网络中相当次优路径的最少量可达性信息,但如果检测到寿命较长或服务质量敏感的流,则可以发现更优化的路径。
As with all tradeoffs in network engineering, the two options described here are not exclusive. It is possible to implement a control plane that contains some proactive, and some reactive, elements. For instance, it is possible to build a control plane that has minimal amounts of reachability information describing rather suboptimal paths through the network, but that can discover more optimal paths if a longer lived, or quality of service sensitive flow, is detected.
返回图11-8作为参考,假设已经部署了反应式控制平面,并且B想要开始与G交换数据流。C如何开发正确交换此流量所需的转发信息?
Returning to Figure 11-8 as a reference, assume a reactive control plane has been deployed, and B would like to start exchanging data flows with G. How can C develop the forwarding information required to correctly switch this traffic?
路由器可以通过网络发送查询或向控制器发送查询以发现到达目的地的路径。例如:
The router can send a query through the network or send a query to a controller to discover a path to the destination. For instance:
• 当B 首次连接到网络,并且C 获悉此新连接的主机时,C 可以将有关B 作为可到达目的地的信息发送到连接到网络的控制器。
• When B first connects to the network, and C learns about this newly attached host, C could send information about B as a reachable destination to a controller attached to the network.
• 以同样的方式,当G 连接到网络并且D 获悉该新连接的主机时,D 可以将有关G 作为可到达目的地的信息发送到连接到网络的控制器。
• In the same way, when G connects to the network, and D learns about this newly attached host, D could send information about G as a reachable destination to a controller attached to the network.
因为控制器了解连接到网络的每个主机(或可到达的目的地)(并且在某些系统中,还了解网络的整个拓扑),所以当 C 需要了解如何到达主机 G 时,路由器可以查询控制器,它可以提供此信息。
Because the controller learns about every host (or reachable destination) attached to the network (and, in some systems, the entire topology of the network, as well), when C needs to learn how to reach host G, the router can query the controller, which can provide this information.
笔记
Note
集中式控制器的概念意味着单个控制器为整个网络提供信息,但这并不是集中式控制平面这一术语在整个网络工程领域的常用方式。然而,集中化的想法在网络工程中相当松散。集中式通常不是指单个设备,而是指不通过网络逐跳承载,并且不由每个网络设备独立计算。有关详细信息,请参阅第 18 章“集中控制平面”。
The concept of a centralized controller implies a single controller providing information for the entire network, but this is not how the term centralized control plane is commonly used throughout the network engineering world. The idea of centralization, however, is rather loose in network engineering. Rather than indicating a single device, centralized is generally used to mean not carried hop by hop through the network, and not computed by each network device independently. See Chapter 18, “Centralized Control Planes,” for more information.
路由器(或主机)可以发送探索者数据包,记录从源到目的地的路由,并将此信息报告给探索者的源,然后将其用作源路由。图 11-9说明了这一点。
The router (or host) can send an explorer packet that records the route from the source to the destination and report this information to the source of the explorer, which is then used as a source route. Figure 11-9 illustrates.
使用图 11-9,并假设基于主机的源路由:
Using Figure 11-9, and assuming host-based source routing:
1. 主机A需要向H发送数据包,但没有路径。
1. Host A needs to send a packet to H but does not have a path.
2. A向其默认网关路由器 C发送一个探索者。
2. A sends an explorer to its default gateway, router C.
3. C 没有到达目的地的路由,因此它将探索者数据包转发到除接收数据包的链路之外的所有链路;因此为 B、D 和 E。
3. C does not have a route to the destination, so it forwards the explorer packet onto all links other than the one it received the packet on; hence to B, D, and E.
4. B 是主机,没有进一步的接口,并且不是探索者的目标,因此它忽略探索者数据包。
4. B is a host, has no further interfaces, and is not the target of the explorer, so it ignores the explorer packet.
5. D 和 E 都没有到 H 的路径,因此它们都将探索器转发到除接收数据包的接口之外的所有接口;因此进入他们和 F 之间共享的多路访问链路。
5. Neither D nor E has a path to H, so they both forward the explorer onto all interfaces except the one they received the packet on; hence onto the multi-access link shared between themselves and F.
6. F收到同一个探索者数据包的两份副本;它根据某些本地标准(例如第一个收到的数据包或某些控制平面策略)选择一个数据包,并将其转发到所有未收到数据包的接口上,并转发至 G。
6. F receives two copies of the same explorer packet; it chooses one based on some local criteria (such as the first received, or some control plane policy) and forwards it onto all the interfaces on which it did not receive the packet, toward G.
7. G 接收到数据包,并且由于没有到达 H 的路径,因此将其转发到它拥有的唯一一条其他链路上,从而通向 H。
7. G receives the packet and, given it does not have a path to reach H, forwards it onto the only other link it has, which leads to H.
8. H接收探索者并响应。
8. H receives the explorer and responds.
在此方案中,沿路径的每个设备将其自身添加到遍历的节点列表中,然后将探索器数据包转发到除接收数据包的接口之外的所有接口。这样,当H收到探索者数据包(其最终目的是寻找到H的路径)时,该数据包现在描述了从A到H的完整路径。当H回复探索者时,它将这条路径放入主体中数据包;当 A 收到响应时,它现在将拥有从 A 到 H 的完整路径。
In this scheme, each device along the path adds itself to a list of traversed nodes before forwarding the explorer packet to all interfaces except the one on which it was received. In this way, when H receives the explorer packet (which is ultimately directed at finding a path to H), the packet now describes a complete path from A to H. When H replies to the explorer, it places this path into the body of the packet; when A receives the response, it will now have a complete path from A to H.
笔记
Note
在一些实现中,A不会生成或接收对探索器数据包的响应。相反,第一跳路由器 C 可以执行这些功能。同样,H 本身可能不会响应这些探索者数据包,而是 G 或路径上具有有关如何到达 G 的信息的任何其他网络设备响应。然而,在这些情况下,一般概念和处理保持不变。
In some implementations, A would not either generate or receive the response to the explorer packet. Rather C, the first hop router, could perform these functions. In the same way, H itself may not respond to these explorer packets, but rather G, or any other network device along the path that has information about how to reach G. The general concept and processing remain the same in these cases, however.
为了将数据包发送到H,A将该路径以包含路径[A,C,D,F,G,H]的源路由的形式插入到数据包头中。当每个路由器收到此数据包时,它将检查标头中的源路由,以确定下一步将流量转发到哪个路由器。例如,C会检查数据包头中的源路由信息,并确定接下来需要将数据包发送到D,而D会检查此信息并确定需要将数据包发送到F。
To send packets to H, then, A inserts this path into the packet header in the form of a source route containing the path [A,C,D,F,G,H]. When each router receives this packet, it will examine the source route in the header to determine which router to forward the traffic to next. For instance, C will examine the source route information in the packet header and determine the packet needs to be sent to D next, while D will examine this information and determine it needs to send the packet to F.
笔记
Note
在一些实现中,每个探索器实际上被发送到目的地,然后目的地确定流量应采用哪条路径。事实上,有多种不同的方法来实现源路由;这里给出的过程只是一个例子来解释源路由的一般思想。
In some implementations, every explorer is actually sent to the destination, which then determines which path traffic should take. There are, in fact, a number of different ways to implement source routing; the process given here is just one example to explain the general idea of source routing.
与反应性控制平面相比,主动控制平面在信息可用时(而不是在需要转发数据包时)在整个网络中分发可达性和拓扑信息。主动控制平面面临的主要挑战是确保网络中的节点之间可靠地传输可达性和拓扑信息,从而使每个设备都具有相同的可达性信息。
Proactive control planes, in contrast to reactive control planes, distribute reachability and topology information throughout the network when the information becomes available, rather than when it is needed to forward packets. The primary challenge proactive control planes face is in ensuring that reachability and topology information is carried reliably between the nodes in the network, resulting in every device having the same reachability information.
笔记
Note
这确实是一个分布式数据库问题;第 14 章“对拓扑变化的反应”更详细地考虑了数据库上下文中可达性和拓扑的分布。
This is really a distributed database problem; Chapter 14, “Reacting to Topology Changes,” considers the distribution of reachability and topology within the context of a database in more detail.
丢弃控制平面信息可能会导致永久性路由环路或创建路由黑洞(之所以这么说是因为它们会毫无痕迹地消耗传输到目的地的流量),这两者都会严重降低网络对应用程序的有用性(可能是轻描淡写)。有几种广泛使用的机制来确保控制平面信息通过网络的可靠传输。
Dropping control plane information can result in permanent routing loops or create routing black holes (so called because they consume traffic transmitted to destinations with no trace), both of which seriously reduce the usefulness of the network for applications (probably an understatement). There are several widely used mechanisms to ensure the reliable transportation of control plane information through a network.
控制平面可以周期性地传输信息,使较旧的信息超时。这与邻居形成类似,因为网络中的每个路由器都会根据计时器(通常称为更新或通告计时器)将其拥有的可达性信息传输到所有邻居(或在所有接口上,取决于控制平面)。可达性信息一旦收到,就会保存在本地表中,并在一段时间内超时,通常称为保持计时器(同样,就像邻居发现问候)。
A control plane can transmit information periodically, timing out older information. This is similar to neighbor formation, in that each router in the network will transmit the reachability information it has to all neighbors (or on all interfaces, depending on the control plane), based on a timer, usually called an update or advertisement timer. Reachability information, once received, is held in a local table and timed out over some time period, often called the hold timer (again, just like a neighbor discovery hello).
这里描述的其余机制依赖于现有的邻居发现系统来确保可达性信息的可靠传递和持续可靠性。在所有这些系统中:
The remaining mechanisms described here rely on an existing neighbor discovery system to ensure the reliable delivery—and continued reliability—of reachability information. In all of these systems:
• 邻居列表不仅用于驱动新的可达性信息的传输,还用于验证可达性信息的正确接收。
• The list of neighbors is used to drive not only the transmission of new reachability information, but also verifying the correct receipt of reachability information.
• 只要邻居处于活动状态,从该邻居接收到的可达性信息就被认为保持有效。
• So long as a neighbor is active, or alive, reachability information received from that neighbor is assumed to remain valid.
在基于邻居的可达性分布的背景下,有几种常用的机制来使某些可达性信息在设备之间传送;通常,任何给定的控制平面都会部署此处描述的多种技术。
Within the context of neighbor-based reachability distribution, there are several commonly used mechanisms to make certain reachability information is carried device to device; often any given control plane will deploy more than one of the techniques described here.
控制平面可以使用序列号(或其他一些机制)来确保正确的复制。序列号实际上可以用来描述单个数据包和大块的可达性信息;如图 11-10所示。
The control plane can use sequence numbers (or some other mechanism) to ensure correct replication. Sequence numbers can actually be used to describe individual packets and large blocks of reachability information; Figure 11-10 illustrates.
接收到数据包后,接收方可以通过记录其收到的序列号来发送数据包接收确认。单独的序列号可用于描述各个网络层的可达性通过网络传输的信息 (NLRI)。然后可以使用单个序列号来描述分布在多个数据包上的 NLRI 信息。
On receiving a packet, the receiver can send an acknowledgment of the receipt of the packet by noting the sequence numbers it has received. A separate sequence number can be used to describe individual Network Layer Reachability Information (NLRI) as it is carried through the network. NLRI information spread out over several packets can then be described using a single sequence number.
控制平面可以描述数据库以确保正确的复制。例如,控制平面可以将数据库中的信息描述为
The control plane can describe the database to ensure correct replication. For instance, a control plane could describe the information in the database as
• 与包含数据库中包含的可达性信息的各个条目相匹配的序列号列表
• A list of sequence numbers matching individual entries containing reachability information contained in the database
• 数据库中包含的连续序列号组(表示所有序列号的更紧凑的方式)
• Groups of contiguous sequence numbers contained in the database (a somewhat more compact way to represent all the sequence numbers)
• 一组序列号,与每个可达性信息条目中的信息散列配对;这样做的优点是不仅可以描述数据库中的条目,还可以为接收者提供一种验证每个条目内容的方法,而无需携带整个数据库来执行检查
• A set of sequence numbers paired with hashes of the information within each reachability information entry; this has the advantage of not only describing the entries in the database, but also of providing a way for the receiver to verify the contents of each entry, yet without carrying the entire database to perform the check
• 数据库中包含的可达性条目块之间的散列,可以由接收者跨相同条目计算并直接比较以确定条目是否丢失
• A hash across blocks of reachability entries contained in the database, which can be calculated across the same entries by the receiver and directly compared to determine if entries are missing
这些数据库描述符可以定期发送,或者仅在有变化时发送,甚至在其他特定情况下发送,不仅可以确保网络设备具有同步的数据库,还可以确定哪些内容丢失或错误,因此附加信息可以被要求。
These kinds of database descriptors can be transmitted periodically, or only when there are changes, or even in other specific situations to not only ensure the network devices have synchronized databases, but also to determine what is missing or in error, so the additional information can be requested.
这些方案各有优点和缺点;一般来说,协议将实现一种方案,该方案允许实现不仅检查丢失的信息,而且还检查内存中或传输过程中无意损坏的信息。
Each of these schemes has advantages and disadvantages; generally, protocols will implement a scheme that allows an implementation to not only check for missing information, but also information that has been inadvertently corrupted either in memory or during transmission.
在许多情况下,控制平面从另一个控制平面学习可达性和拓扑信息比通过本章到目前为止概述的机制更有效,或者符合特定的策略限制。一些示例可能如下:
There are many instances where it is more effective, or in line with specific policy restrictions, for a control plane to learn reachability and topology information from another control plane, rather than through the mechanisms outlined up to this point in this chapter. Some examples might be as follows:
• 两个组织需要互连其网络,但都不希望对方控制其控制平面的策略和操作
• Two organizations need to interconnect their networks, but neither wants to allow the other to control the policies and operation of their control planes
• 大型组织由许多业务单元组成,每个业务单元可以根据当地情况和应用需求运行自己的内部网络。
• A large organization is made up of many business units, each of which is allowed to run its own internal network based on local conditions and application requirements.
• 组织需要某种方式允许两个控制平面在从一个控制平面过渡到另一个控制平面时进行互操作。
• An organization needs some way to allow two control planes to interoperate while transitioning from one to the other.
允许一个控制平面从另一个控制平面学习可达性信息的原因几乎是无限的。鉴于需求,许多网络设备允许运营商在控制平面之间重新分配信息。重新分配可达性引发了两个与控制平面相关的问题:如何处理指标以及如何防止路由循环。
The reasons for allowing one control plane to learn reachability information from another are almost boundless. Given the requirement, many network devices allow operators to redistribute information between control planes. Redistributing reachability raises two control plane–related problems: how to handle metrics and how to prevent routing loops.
笔记
Note
重新分配可以被视为将路由从一种协议导出到另一种协议。事实上,导入/导出和再分发通常被不同的供应商用来表示同一件事,甚至在不同的情况下被同一供应商用来表示相同的事情。
Redistribution can be seen as exporting routes out of one protocol and into another. In fact, import/export and redistribution are often used to mean the same thing, either by different vendors, or even in different situations by the same vendor.
链路属性、策略和度量之间的关系由每个控制平面协议独立于其他协议定义;事实上,更具描述性或更有用的度量系统有时会吸引操作员使用特定的控制平面协议。图 11-11说明了运行两个不同控制平面的网络的两个部分,每个控制平面使用不同的方法来计算链路度量。
The relationship between link properties, policies, and metrics are defined by each control plane protocol independently of other protocols; in fact, a more descriptive, or otherwise more useful, metric system is what sometimes attracts operators to a specific control plane protocol. Figure 11-11 illustrates two sections of a network running two different control planes, each of which uses a different method to calculate link metrics.
该网络中的协议 X 和 Y 已使用两个不同的系统进行配置以分配指标。在部署协议 X 时,管理员将 1,000 除以链路速度(以千兆位为单位)。在部署协议 Y 时,管理员设置了“指标表”,基于对未来 10 到 15 年可能拥有的最高和最低速度链路的最佳猜测,并将指标分配给该表中的不同链路速度。如图所示,结果是不兼容的指标:
Protocols X and Y, in this network, have been configured using two different systems for assigning metrics. In deploying protocol X, the administrator divided 1,000 by the link speed in gigabits. In deploying protocol Y, the administrator set up a “table of metrics,” based on a best guess at the highest and lowest speed links they might have for the next 10 to 15 years, and assigned metrics to different link speeds within this table. The result, as the illustration shows, is incompatible metrics:
• 协议 X 中的 10G 链路的度量为 100,而协议 Y 中的 10G 链路的度量为 20。
• 10G links in protocol X have a metric of 100, while in protocol Y they have a metric of 20.
• 协议 X 和 Y 中的 100G 链路的度量均为 10。
• 100G links in both protocol X and Y have a metric of 10.
假设较低的度量是优选的,如果添加度量,则[B,C,F]链路将被认为是比[B,D,G]链路更理想的路径。然而,如果考虑带宽,则两条链路将被视为同样理想。
Assuming the lower metric is preferred, if the metrics are added, the [B,C,F] link would be considered a more desirable path than the [B,D,G] link. If the bandwidth is considered, however, both links would be considered equally desirable.
如果在这两个协议之间配置了重新分配,这些指标应该如何处理?此问题有三种常见的解决方案。
If redistribution is configured between these two protocols, how should these metrics be handled? There are three common solutions to this problem.
管理员可以在每个重新分发点分配一个度量,该度量作为内部协议度量的一部分进行携带。例如,当从协议 X 重新分发到 Y 时,管理员可能会为路由器 C 上的目标 E 分配度量 5。该目标 E 由路由器 C 以度量 5 注入到协议 Y 中。在路由器 F 上,通过 C 到达 E 的度量为 25。在 G 处,沿着路径 [F,C] 到达 E 的成本将为 35。当分配这些手动度量时,操作员可以选择对任何特定目的地使用任何特定出口点的意愿。
The administrator can assign a metric at each redistribution point, which is carried as part of the internal protocol metric. For instance, the administrator might assign a metric of 5 to the destination E at router C when redistributing from protocol X into Y. This destination, E, is injected into protocol Y with a metric of 5 by router C. At router F, the metric to E would be 25 through C. At G, the cost to reach E would be 35, along the path [F,C]. The desirability of using any particular exit point for any specific destination is chosen by the operator when these manual metrics are assigned.
“其他”协议的度量可以被接受作为内部协议度量的一部分。当一个协议具有更广泛的范围时,这不起作用可用指标的范围优于其他指标。例如,如果协议 Y 的最大度量为 63,则协议 X 的 10G 度量将“高于最大值”;这种情况不太可能是最佳的。假设没有这样的限制,路由器 C 会将一条到 E 的路由注入到协议 Y 中,成本为 100。在路由器 F 到达 E 的成本将为 110;G 到 [F,C] 的成本为 130。
The metric of the “other” protocol can be accepted as part of the internal protocol metric. This does not work in the case where one protocol has a wider range of available metrics than the other. For instance, if protocol Y has a maximum metric of 63, the 10G metrics from protocol X will be “above maximum”; a situation that is not likely to be optimal. Assuming no such restriction, router C would inject a route to E with a cost of 100 into protocol Y. The cost to reach E at router F would be 110; the cost at G would be 130 through [F,C].
笔记
Note
您可能会在这里认识到控制平面状态和网络的最佳使用之间的权衡,这是现实世界协议设计中复杂性权衡的另一个例子。在单独的字段中携带外部度量增加了控制平面状态,但允许通过网络进行更优化的流量引导。分配或使用外部指标可以减少控制平面状态,但代价是无法优化流量。
You might recognize a tradeoff between control plane state and optimal use of the network here, another instance of the complexity tradeoffs in real-world protocol design. Carrying the external metric in a separate field adds control plane state, but allows more optimal steering of traffic through the network. Assigning or consuming the external metric reduces control plane state, but at the cost of being able to optimize traffic flow.
外部度量可以作为单独的字段来承载,因此每个网络设备可以单独确定到每个外部目的地的最佳路径。第三种解决方案使用最广泛,因为它提供了在两个网络之间引导流量的最佳能力。在此解决方案中,C 以 100 的外部成本注入对 E 的可达性。在 F 处,广告中有两个指标描述对 E 的可达性;到达重分配(或退出)点的内部度量为20,到达外部网络E的度量为100。在G处,到达退出点的内部度量为30,外部度量为100。
The external metric can be carried as a separate field, so each network device can make a separate determination about the best path to each external destination. This third solution is the most widely used, as it provides the best ability to steer traffic between the two networks. In this solution, C injects reachability to E with an external cost of 100. At F, there are two metrics in the advertisement describing reachability to E; the internal metric to reach the redistribution (or exit) point is 20, and the metric to reach E within the external network is 100. At G, the internal metric to reach the exit point is 30, and the external metric is 100.
实施将如何使用这两个指标?协议应该选择最近的退出点,还是最低的内部指标?这将优化本地网络的使用,并可能降低外部网络中网络资源的使用。协议应该选择最接近外部目的地的退出点,还是最低的外部指标?这将优化外部网络中的网络资源,但可能会以取消优化本地网络中网络资源的使用为代价。或者协议应该尝试以某种方式结合这两个指标,以尽可能优化两个网络中资源的使用?
How would an implementation use both of these metrics? Should the protocol choose the closest exit point, or rather the lowest internal metric? This would optimize the local network usage, and potentially deoptimize the usage of network resources in the external network. Should the protocol choose the exit point closest to the external destination, or rather the lowest external metric? This would optimize network resources in the external network, potentially at the cost of deoptimizing the use of network resources in the local network. Or should the protocol try to combine these two metrics in some way, to optimize the use of resources in both networks as much as possible?
一些协议选择始终优化本地或外部资源,而另一些协议将为运营商提供配置选项。例如,协议可以允许将外部度量作为不同类型的度量来承载,其中一种类型被认为比任何内部度量都大(因此优先选择最低的内部度量,并使用外部度量作为决胜局),并且另一种类型是内部和外部度量被认为是等效的(因此添加内部和外部度量来做出路径决策)。
Some protocols choose to always optimize local or external resources, while others will provide operators with a configuration option. For instance, a protocol may allow external metrics to be carried as different types of metrics, where one type is considered larger than any internal metric (hence preferring the lowest internal metric first, and using the external metric as a tie breaker), and the other type is where the internal and external metrics are considered equivalent (hence adding the internal and external metrics to make a path decision).
在上面的讨论中,您可能已经注意到,从一种协议重新分发到另一种协议的目标总是看起来好像它们已连接到重新分发路由器。本质上,重新分配充当一种汇总形式(这意味着删除了拓扑信息,而不是可达性信息),如本章前面所述。虽然这一点对于重新分配指标并不重要,但重要的是要考虑控制平面选择最佳路径的能力。在某些特定情况下,去优化可能会导致控制平面完全无法选择无环路路径;如图 11-12所示。
In the discussion above, you might have noticed that destinations redistributed from one protocol to another always appear as if they are connected to the redistributing router. In essence, redistribution acts as a form of summarization (which means topology information is removed, rather than reachability information), as described earlier in this chapter. While this point isn’t crucial to redistribution metrics, it is important to consider in the ability of the control plane to choose the optimal path. In some specific cases, deoptimization can lead to a complete failure of the control plane to choose loop-free paths; Figure 11-12 illustrates.
要在此网络中构建路由循环:
To build the routing loop in this network:
1. 到主机 A 的路由从协议 X 重新分配到 Y,手动配置的度量为 1。
1. The route to host A is redistributed from protocol X to Y with a manually configured metric of 1.
2. 路由器 E 优先选择经过 C 的路由,且总度量(内部和外部)为 2。
2. Router E prefers the route through C with a total metric (internal and external) of 2.
3. 路由器D优先选择经过E的路由,总度量为3。
3. Router D prefers the route through E with a total metric of 3.
4. 路由器 D 将到主机 A 的路由重新分配到协议 X 中,现有度量为 3。
4. Router D redistributes the route to host A into protocol X with the existing metric of 3.
5. 路由器 B 有两条到 A 的路由:一条成本为 10(直接),另一条通过 D 的度量为 4。
5. Router B has two routes to A: one with a cost of 10 (directly) and one with a metric of 4 through D.
6. RouterB 选择经过D 的路径,形成路由环路。
6. Router B chooses the path through D, creating a routing loop.
7. 依此类推(循环将继续,直到每个协议达到其最大度量)。
7. And so on (the loop will continue until each protocol reaches its maximum metric).
这个例子有点延伸到在一个普通网络中创建路由环路,但所有由重新分配引起的路由环路在结构上都是相似的。重要的是,在这个例子中,不仅拓扑信息丢失了(到 A 的路由已被总结,从 E 的角度来看,似乎直接附加到 C),而且度量信息也丢失了(原始的路由)。路由的成本为 11,在 C) 处重新分配到协议 Y,成本为 1。有许多常见机制可用于防止形成此路由环路。
This example is a little stretched to create a routing loop in a trivial network, but all routing loops caused by redistribution are similar in their structure. It is important, in this example, that not only has topology information been lost (the route to A has been summarized, appearing, from E’s perspective, to be directly attached to C), but metric information has been lost as well (the original route, with a cost of 11, is redistributed into protocol Y with a cost of 1 at C). There are a number of common mechanisms used to prevent this routing loop from forming.
路由协议总是可以优先选择内部路由而不是外部路由。在这种情况下,如果 B 始终优先选择到 A 的内部路由而不是经过 D 的外部路径,则不会形成路由环路。许多路由协议在将路由安装到本地路由表(或路由信息库,RIB)时会使用排序首选项,以始终优先选择内部路由而不是外部路由。这样做的原因是为了防止形成这种类型的路由环路。
The routing protocol can always prefer internal over external routes. In this case, if B always prefers the internal route to A over the external path through D, the routing loop cannot form. Many routing protocols will use an ordering preference when installing routes into the local routing table (or Routing Information Base, RIB), to always prefer internal routes over external ones. The reason for this preference is to prevent routing loops of this type from forming.
可以配置过滤器以防止单个目的地被重新分配两次。在此网络中,路由器 D 可以配置为防止在协议 Y 中接收的任何外部路由重新分配到协议 X 中。在只有两个协议(或网络)之间重新分配控制平面信息的情况下,这可以是简单的解决方案。在需要为每个目的地配置过滤器的情况下,过滤器很快就会变得难以管理。配置这些过滤器时的错误可能会导致某些目标无法到达(路由黑洞),或者导致形成环路,从而可能导致控制平面出现故障。
Filters could be configured to prevent individual destinations from being redistributed twice. In this network, router D could be configured to prevent any external route received in protocol Y from being redistributed into protocol X. In a situation where there are only two protocols (or networks) with control plane information redistributed between them, this can be a simple solution. In cases where the filters need to be configured for each destination, the filters can quickly become difficult to manage. Mistakes in configuring these filters can either cause some destinations to become unreachable (routing black holes), or permit a loop to form, potentially causing a failure in the control plane.
路由在重新分配时可以被标记,然后在其他重新分配点根据这些标记进行过滤。例如,当到 A 的路由在 C 处重新分配到协议 Y 中时,可以使用某个数字(例如100)对路由进行管理标记,以便可以轻松识别该路由。在路由器D处,过滤器可以被配置为阻止任何标记有标签100的路由,从而防止形成路由环路。许多协议允许路由携带管理标记(有时称为团体或其他类似名称),然后根据此标记过滤路由。
Routes can be tagged when they are redistributed, and then filtered based on these tags at other redistribution points. For instance, when the route to A is redistributed into protocol Y at C, the route could be administratively tagged with some number, such as 100, so the route can be easily identified. At router D, a filter could be configured to block any route marked with the tag 100, preventing the routing loop from forming. Many protocols allow a route to carry an administrative tag (sometimes called a community, or some other similar name), and then to filter routes based on this tag.
本章涵盖了很多内容,主要是在考虑控制平面在一些基本领域面临的各种问题的过程中。对于每个问题,都提供了一系列解决方案,其中许多解决方案是通过全球运行网络中使用的真实控制平面协议来实现的。
This chapter covered a lot of ground, mostly in the process of considering a wide array of problems that control planes face in some fundamental areas. For each of these problems, a range of solutions was offered, many of which are implemented by real control plane protocols used in running networks throughout the world.
基于每个链路发现拓扑是首先考虑的问题,包括检测其他网络设备、确定设备之间是否存在双向连接以及确定 MTU(以及是否匹配)。了解可到达的目的地是考虑的第二个问题。这里考虑了两大类解决方案:被动式和主动式。广告可达性信息被分为相同的两大类:被动式和主动式,但还详细考虑了通过网络的信息可靠传输。最后,考虑了路由协议之间的重新分配,因为这是控制平面以间接方式了解可到达目的地的常见方法。
Discovering the topology on a per link basis was the first problem considered, including detecting other network devices, determining if two-way connectivity exists between devices, and determining the MTU (and whether or not it matches). Learning about reachable destinations was the second problem considered. Two broad classes of solutions were considered here: reactive and proactive. Advertising reachability information was divided into the same two broad classes, reactive and proactive, but reliable transmission of information through the network was also considered in some detail. Finally, redistribution between routing protocols was considered, as this is a common way for a control plane to learn about reachable destinations in an indirect way.
在考虑第 15 章“距离矢量控制平面”和第 16 章“链路状态和路径矢量控制平面”中的实际协议实现时,您将再次遇到这些问题及其解决方案,其中考虑了分布式和集中式控制平面实现更多详情。这些问题及其解决方案都是控制平面协议在现实世界中成功运行的基础。
You will meet these problems, and their solutions, again in considering actual protocol implementations in Chapter 15, “Distance Vector Control Planes,” and Chapter 16, “Link State and Path Vector Control Planes,” which consider distributed and centralized control plane implementations in more detail. Each of these problems and their solutions are fundamental to the operation of successful control plane protocols in the real world.
Alekseev,VB、VP Kozyrev 和 AA Sapozhenko。“图论”,2011 年 2 月。https://www.encyclopediaofmath.org/index.php/Graph_theory。
Alekseev, V. B., V. P. Kozyrev, and A. A. Sapozhenko. “Graph Theory,” February 2011. https://www.encyclopediaofmath.org/index.php/Graph_theory.
Caldwell, Chris K.“图论教程”,1995。http: //primes.utm.edu/graph/。
Caldwell, Chris K. “Graph Theory Tutorials,” 1995. http://primes.utm.edu/graph/.
多伊尔、杰夫和詹妮弗·德黑文·卡罗尔。路由 TCP/IP,第 1 卷。第二版。印度新德里:思科出版社,2005 年。
Doyle, Jeff, and Jennifer DeHaven Carroll. Routing TCP/IP, Volume 1. 2nd edition. New Delhi, India: Cisco Press, 2005.
“增强型内部网关路由协议。” 思科。访问日期:2017 年 9 月 4 日。https ://www.cisco.com/c/en/us/support/docs/ip/enhanced-interior-gateway-routing-protocol-eigrp/16406-eigrp-toc.html。
“Enhanced Interior Gateway Routing Protocol.” Cisco. Accessed September 4, 2017. https://www.cisco.com/c/en/us/support/docs/ip/enhanced-interior-gateway-routing-protocol-eigrp/16406-eigrp-toc.html.
黄彭、郭传雄、周立东、Jacob R. Lorch、Yingnong Dang、Murali Chintalapati 和 Randolph Yao。“灰色故障:云规模系统的致命弱点。” 第 16 届操作系统热点研讨会论文集,150-55。HotOS '17。美国纽约州纽约:ACM,2017。doi:10.1145/3102980.3103005。
Huang, Peng, Chuanxiong Guo, Lidong Zhou, Jacob R. Lorch, Yingnong Dang, Murali Chintalapati, and Randolph Yao. “Gray Failure: The Achilles’ Heel of Cloud-Scale Systems.” In Proceedings of the 16th Workshop on Hot Topics in Operating Systems, 150–55. HotOS ’17. New York, NY, USA: ACM, 2017. doi:10.1145/3102980.3103005.
克雷布斯、瓦尔迪斯。“路由器的社交生活。” 互联网协议杂志,2000 年 12 月。http://www.orgnet.com/SocialLifeOfRouters.pdf。
Krebs, Valdis. “The Social Life of Routers.” Internet Protocol Journal, December 2000. http://www.orgnet.com/SocialLifeOfRouters.pdf.
莱希,凯文。TCP 路径 MTU 发现问题。征求意见 2923。RFC 编辑,2000。doi:10.17487/RFC2923。
Lahey, Kevin. TCP Problems with Path MTU Discovery. Request for Comments 2923. RFC Editor, 2000. doi:10.17487/RFC2923.
马西斯、马特和约翰·赫夫纳。分组层路径 MTU 发现。征求意见 4821。RFC 编辑,2007。doi:10.17487/RFC4821。
Mathis, Matt, and John Heffner. Packetization Layer Path MTU Discovery. Request for Comments 4821. RFC Editor, 2007. doi:10.17487/RFC4821.
杰克·麦肯、史蒂芬·E·迪林、杰弗里·莫古尔和罗伯特·M·欣登。IP 版本 6 的路径 MTU 发现。征求意见 8201。RFC 编辑,2017。doi:10.17487/RFC8201。
McCann, Jack, Stephen E. Deering, Jeffrey Mogul, and Robert M. Hinden. Path MTU Discovery for IP Version 6. Request for Comments 8201. RFC Editor, 2017. doi:10.17487/RFC8201.
Medved、Jan、Nitin Bahadur、Hariharan Ananthakrishnan、Xufeng Liu、Robert Varga 和 Alexander Clemm。“网络拓扑的数据模型。” 互联网草案。互联网工程任务组,2017 年 3 月。https: //tools.ietf.org/html/draft-ietf-i2rs-yang-network-topo-12。
Medved, Jan, Nitin Bahadur, Hariharan Ananthakrishnan, Xufeng Liu, Robert Varga, and Alexander Clemm. “A Data Model for Network Topologies.” Internet-Draft. Internet Engineering Task Force, March 2017. https://tools.ietf.org/html/draft-ietf-i2rs-yang-network-topo-12.
莫伊、约翰. OSPF 版本 2。征求意见。RFC 编辑,1998 年 4 月。doi:10.17487/RFC2328。
Moy, John. OSPF Version 2. Request for Comments. RFC Editor, April 1998. doi:10.17487/RFC2328.
雷赫特、雅科夫、苏珊·黑尔斯和托尼·李。边界网关协议 4 (BGP-4)。
Rekhter, Yakov, Susan Hares, and Tony Li. A Border Gateway Protocol 4 (BGP-4).
征求意见 4271。RFC 编辑,2006。doi:10.17487/rfc4271。
Request for Comments 4271. RFC Editor, 2006. doi:10.17487/rfc4271.
雷塔纳、阿尔瓦罗、拉斯·怀特和唐·斯莱斯。IP 的 EIGRP:基本操作和配置。第一版。马萨诸塞州波士顿:Addison-Wesley Professional,2000 年。
Retana, Alvaro, Russ White, and Don Slice. EIGRP for IP: Basic Operation and Configuration. 1st edition. Boston, MA: Addison-Wesley Professional, 2000.
萨维奇、唐尼、史蒂文·摩尔、詹姆斯·吴、拉斯·怀特、唐纳德·斯莱斯和彼得·帕鲁克。思科的增强型内部网关路由协议 (EIGRP)。征求意见 7868。RFC 编辑器,2016。https: //rfc-editor.org/rfc/rfc7868.txt。
Savage, Donnie, Steven Moore, James Ng, Russ White, Donald Slice, and Peter Paluch. Cisco’s Enhanced Interior Gateway Routing Protocol (EIGRP). Request for Comments 7868. RFC Editor, 2016. https://rfc-editor.org/rfc/rfc7868.txt.
怀特、拉斯、阿尔瓦罗·雷塔纳和唐·斯莱斯。最优路由设计。第一版。思科出版社,2005 年。
White, Russ, Alvaro Retana, and Don Slice. Optimal Routing Design. 1st edition. Cisco Press, 2005.
1. 将每个设备分类为传输节点或叶节点:
1. Classify each device as either a transit or a leaf node:
A。手机被用作热点
a. A mobile phone being used as a hot spot
b. 路由器一个
b. A router
C。数据库服务器
c. A database server
d. 一个开关
d. A switch
e. 代理服务器
e. A proxy server
2. 解释本章(以及整本书)中使用的聚合和汇总之间的区别。
2. Explain the difference between aggregation and summarization as it is used in the chapter (and throughout this book).
3. 请注意以下每个路由协议中使用的邻居发现、双向连接检查和链路 MTU 发现的类型:
3. Note the kind of neighbor discovery, two-way connectivity check, and link MTU discovery used in each of the following routing protocols:
A。开放最短路径优先 (OSPF)
a. Open Shortest Path First (OSPF)
b. 中间系统到中间系统 (IS-IS)
b. Intermediate System to Intermediate System (IS-IS)
C。路由信息协议 (RIP)
c. Routing Information Protocol (RIP)
d. 边界网关协议 (BGP)
d. Border Gateway Protocol (BGP)
4. 将以下每个协议分类为被动或主动发现拓扑并计算通过网络的无环路路径集:
4. Classify each of the following protocols as reactively or proactively discovering the topology and calculating the set of loop-free paths through the network:
A。生成树协议 (STP)
a. Spanning Tree Protocol (STP)
b. 开放最短路径优先 (OSPF)
b. Open Shortest Path First (OSPF)
C。巴贝尔
c. BABEL
d. 开放流
d. OpenFlow
5. 将以下每个协议分类为被动或主动发现和通告可达目的地:
5. Classify each of the following protocols as reactively or proactively discovering and advertising reachable destinations:
A。生成树协议 (STP)
a. Spanning Tree Protocol (STP)
b. 开放最短路径优先 (OSPF)
b. Open Shortest Path First (OSPF)
C。巴贝尔
c. BABEL
d. 开放流
d. OpenFlow
6. 描述一种情况,其中用于保存转发信息的缓存中的溢出可能导致控制平面转发数据包,直到缓存超时或以其他方式清除。
6. Describe a situation where an overflow in a cache used to hold forwarding information can cause the control plane to forward packets until the cache is either timed out or otherwise cleared.
7. 阅读灰色故障的解释(来自“进一步阅读”部分中提到的论文)。您认为灰色故障与邻居状态发现和双向连接检查有何关系?
7. Read the explanation of a gray failure (from the paper noted in the “Further Reading” section). How do you think gray failures might relate to the discovery of neighbor status and checking for two-way connectivity?
8. 假设您可以在重新分发路由时对其进行标记,然后在所有其他重新分发点根据这些标记进行过滤。您能解释一下如何使用这种标记来防止重新分配路由循环吗?
8. Assume you could tag routes as they are being redistributed, and then filter based on those tags at all other redistribution points. Can you explain how this kind of tagging could be used to prevent redistribution routing loops?
9. 似乎可以建立一个表,在重新分配过程中自动将度量标准从一种协议转换为另一种协议,但很少(几乎没有)路由协议设计有这种功能。这样的系统会出现什么问题呢?
9. It seems it would be possible to build a table that converts metrics from one protocol to another automatically during the redistribution process, and yet very few (almost no) routing protocols are designed with this kind of capability. What would be the problem with such a system?
10. 一种协议,增强型内部网关路由协议(EIGRP),确实允许路由进程直接从外部路由进程设置外部度量。您能找出可能出现这种情况的情况并解释原因吗?
10. One protocol, the Enhanced Interior Gateway Routing Protocol (EIGRP), does allow a routing process to set the external metrics directly from the external routing process. Can you figure out the circumstances when this is possible, and explain why?
网络工程师通常认为控制平面可以做各种各样的事情,从计算通过网络的最短路径到分发用于转发数据包的策略。然而,最短路径的想法隐藏在最佳路径的概念中。同样,政策的理念也潜藏在网络资源优化的概念中。虽然策略和最短路径都很重要,但它们都不是控制平面所做工作的根源。工作控制平面的目的是首先通过网络找到一组无环路径;优化是一个很好的附加组件,但优化只能在找到一组无循环路径的情况下“完成”。
Network engineers typically think of the control plane as doing a wide variety of things, from calculating the shortest path through the network to distributing policy used to forward packets. The idea of the shortest path, however, sneaks in the concept of the optimal path. Likewise, the idea of policy also sneaks in the concept of optimization of network resources. While both policy and the shortest path are important, neither one of these is at the root of what the control plane does. The job of the control plane is to find a set of loop-free paths through a network first; optimization is a nice add-on, but optimization can only be “done” in the context of finding a set of loop-free paths.
那么本章要回答的问题是
The question this chapter will answer, then, is
控制平面如何计算通过网络的无环路路径?
How does a control plane calculate loop-free paths through a network?
本章将首先检查最短或最低度量、路径和无环路路径之间的关系。下一个要考虑的主题是无环路备用 (LFA) 路径,它们不是最佳路径,但仍然是无环路。此类路径可用于设计控制平面,以便在网络拓扑发生故障或发生变化时快速从最佳路径切换到备用无环路路径。然后讨论用于查找一组无环路路径的两种具体机制;另外两个将在第 13 章“单播无环路路径 (2) ”中讨论。
This chapter will begin by examining the relationship between the shortest, or lowest metric, path and loop-free paths. The next topic considered is Loop-Free Alternate (LFA) paths, which are not the best paths but still loop free. Such paths are useful in designing control planes that quickly switch from the best path to an alternate loop-free path in the case of failures or changes in the network topology. Two specific mechanisms used for finding a set of loop-free paths are then discussed; two more are discussed in Chapter 13, “Unicast Loop-Free Paths (2).”
最短路径(通常以度量形式表示)与无环路路径之间的关系相当简单:最短路径始终是无环路。这种关系的原因可以用几何(或更具体地说,图论,它是离散数学中的一个专门研究领域)来最简单地表达。图12-1用于解释原因。
The relationship between the shortest path, generally in terms of metrics, and loop-free paths is fairly simple: the shortest path is always loop free. The reason for this relationship can be expressed most simply in terms of geometry (or more specifically graph theory, which is a specialized field of study within discrete mathematics). Figure 12-1 is used to explain why.
从 A、B、C、D 到目的地有哪些可用路径?
What are the paths available from A, B, C, and D toward the destination?
• 从A:[B,H];[C、E、H];[D、F、G、H]
• From A: [B,H]; [C,E,H]; [D,F,G,H]
• 来自B:[H];[A、C、E、H];[A、D、F、G、H]
• From B: [H]; [A,C,E,H]; [A,D,F,G,H]
• 从D:[F,G,H];[A、C、E、H];[A、B、H]
• From D: [F,G,H]; [A,C,E,H]; [A,B,H]
如果网络中的每个设备都必须独立选择通往目的地的路径(不参考任何其他设备选择的路径),则可能形成持续环路。例如,A 可以选择路径 [D,F,G,H],D 可以选择路径 [A,C,E,H]。然后,设备 A 会将前往目的地的流量转发到 D,然后 D 会将前往目的地的流量转发到 A。 除了选择用于在每个设备上计算路径的算法实现的路径之外,还必须有一些规则,例如选择最短(或最低成本)路径。但为什么选择最短(或最低成本)路径可以防止环路呢?图 12-2说明了这一点。
If every device in the network must choose the path it will use toward the destination independently (without reference to the path chosen by any other device), it is possible to form persistent loops. For instance, A could choose the path [D,F,G,H], and D could choose the path [A,C,E,H]. Device A will then forward traffic toward the destination to D, and D will then forward traffic toward the destination to A. There must be some rule other than choose a path implemented by the algorithm used to calculate a path on each device, such as choose the shortest (or lowest cost) path. But why does choosing the shortest (or lowest cost) path prevent the loop? Figure 12-2 illustrates.
图12-2假设A选择到达目的地的路径[D,F,G,H],D选择经过A到达目的地的路径。D 无法知道 A 正在使用通过 D 本身的路径到达目的地,因为它正在计算到目的地的路径,而对 A 所计算的内容一无所知。控制平面如何避免这样的循环呢?通过观察,沿着环路的路径的成本必须始终包含环路的成本以及路径的无环路元素。在这种情况下,从D的角度来看,经过A的路径必须包括从D到目的地的成本。因此,从 D 的角度来看,通过 A 的成本将始终大于 D 的最低可用成本。这导致以下观察结果:
Figure 12-2 assumes A chooses the path [D,F,G,H] to the destination, and D chooses the path through A to the destination. What D cannot know, because it is calculating a path to the destination without any knowledge of what A has calculated, is that A is using the path through D itself to reach the destination. How can the control plane avoid such a loop? By observing that the cost of a path along a loop must always contain the cost of the loop as well as the loop-free element of the path. In this case, the path through A, from the perspective of D, must include the cost from D to the destination. Hence the cost through A, from the perspective of D, will always be greater than the lowest available cost from D. This leads to the following observation:
最低成本(或最短)路径不能包含经过计算节点的路径;或者更确切地说,最短路径始终是无循环的。
The lowest cost (or shortest) path cannot contain a path that passes through the calculating node; or rather, the shortest path is always loop free.
这一观察有两点很重要。
There are two important points about this observation.
首先,这个观察并没有说成本较高的路径一定是环路,只是说成本最低的路径一定不是环路。可以扩展规则以发现超出最低成本路径的更广泛的无环路径集;这些称为无循环替代。
First, this observation does not say paths with higher costs are definitely loops, only that the lowest cost path must not be a loop. It is possible to expand the rule to discover a wider set of loop-free paths beyond the lowest cost path; these are called Loop-Free Alternates.
其次,只有当网络中的每个节点都具有相同的网络拓扑视图时,这一观察结果才成立。由于多种原因,节点可以对网络拓扑有不同的看法;例如:
Second, this observation holds only if every node in the network has the same view of the network topology. Nodes can have different views of the network topology for a number of reasons; for instance:
• 网络拓扑发生变化,且尚未通知所有节点;因此出现了微循环。
• The network topology has changed, and all the nodes have not yet been notified of the change; hence microloops.
• 一些有关网络拓扑的信息已通过汇总或聚合从拓扑数据库中删除。
• Some information about the network topology has been removed from the topology database through summarization or aggregation.
• 度量标准已配置,因此最低成本路径从不同角度来看是不一致的。
• The metrics have been configured so the lowest cost path is inconsistent from different perspectives.
真实网络中使用的控制平面经过精心设计,可以解决或最小化具有不同网络拓扑视图的不同设备的影响,从而可能导致循环路径。例如:
Control planes used in real networks are carefully crafted to either work around or minimize the impact of different devices having different views of the network topology, potentially causing a looped path. For instance:
• 控制平面经过仔细调整,以最小化了解拓扑更改和修改转发之间的时间差(或在拓扑更改期间丢弃流量,而不是转发流量)。
• Control planes are carefully tuned to minimize the time differential between learning of a topology change and modifying forwarding (or to drop traffic during topology changes, rather than forwarding it).
• 在汇总拓扑或聚合可达性时,请注意保留成本信息。
• When summarizing topology or aggregating reachability, care is taken to preserve cost information.
• 网络设计“最佳通用实践”鼓励使用对称度量,并且许多实现使得很难或不可能配置具有真正危险度量的链路,例如零链路成本。
• Network design “best common practices” encourage the use of symmetric metrics, and many implementations make it difficult or impossible to configure links with truly dangerous metrics, such as a zero link cost.
通常需要大量的设计工作来发现、解决或防止现实世界控制平面协议中最短路径规则的意外破坏。
It often takes a great deal of design work to find, and work around or prevent, the unintended subversion of the shortest path rule in real-world control plane protocols.
简单的最短路径规则用于构建现实网络中一组路径而不是单个路径的描述。虽然可以使用多种不同类型的树来表示拓扑或网络中的一组路径,但有两种常用来描述计算机网络:最小生成树 (MST) 和最短路径树 (SPT)。这两种树之间的区别通常很微妙。以图12-3所示的网络来说明MST和SPT。
The simple shortest path rule is used to build a description of a set of paths, rather than a single path, in real-world networks. While a number of different kinds of trees can be used to represent a set of paths through a topology or network, there are two commonly used to describe computer networks: the Minimum Spanning Tree (MST) and the Shortest Path Tree (SPT). The difference between these two kinds of trees is often subtle. The network shown in Figure 12-3 will be used to illustrate the MST and SPT.
在图12-3中,许多不同的路径将触及每个节点;例如,从A的角度来看:
In Figure 12-3, a number of different paths will touch every node; for instance, from A’s perspective:
1. [A,B,E,D,C] 和 [A,C,D,E,B],每项总成本为 10
1. [A,B,E,D,C] and [A,C,D,E,B], each with a total cost of 10
2. [A,B,E] 成本为 5,[A,C,D] 成本为 3,总成本为 8
2. [A,B,E] with a cost of 5 and [A,C,D] with a cost of 3, for a total cost of 8
3. [A,C,D,E] 成本为 6,[A,B] 成本为 1,总成本为 7
3. [A,C,D,E] with a cost of 6 and [A,B] with a cost of 1, for a total cost of 7
MST是一棵树,它以最小的总成本(通常以网络中选择的所有链路的总和来衡量)访问网络中的每个节点。计算 MST 的算法将选择选项 3,因为它沿着到达网络中每个节点所需的边集具有最低的总成本。
An MST is a tree that visits each node in the network with the minimum overall cost (normally measured as the sum of all the links chosen in the network). An algorithm that computes the MST will choose option 3, as it has the lowest total cost along the set of edges required to reach every node in the network.
SPT描述了到网络中每个目的地的最短路径,与图的总成本无关。从 A 的角度来看,计算 SPT 的算法会选择:
An SPT describes the shortest path to each destination in the network, independent of the total cost of the graph. An algorithm that calculates an SPT would choose, from A’s perspective:
• [A,B] 到 B,成本为 1,因为该路径比 [A,C,D,E,B] 短,成本为 10
• [A,B] to B with a cost of 1, as this path is shorter than [A,C,D,E,B] with a cost of 10
• [A,B,E] 到 E,成本为 5,因为这比成本为 6 的 [A,C,D,E] 短
• [A,B,E] to E with a cost of 5, as this is shorter than [A,C,D,E] with a cost of 6
• [A,C] 到 C 的成本为 1,因为这比成本为 10 的 [A,B,E,D,C] 短
• [A,C] to C with a cost of 1, as this is shorter than [A,B,E,D,C] with a cost of 10
• [A,C,D] 到 D 的成本为 3,因为这比成本为 8 的 [A,B,E,D] 短
• [A,C,D] to D with a cost of 3, as this is shorter than [A,B,E,D] with a cost of 8
将最短路径集与上面将触及每个节点的路径集进行比较,计算 SPT 的算法将选择选项 2,而不是前面列表中的选项 3。换句话说,SPT将忽略MST中边的总成本来寻找到每个可达目的地(在本例中为节点)的最短路径,而MST将忽略到每个可达目的地的最短路径以最小化整个图的成本。
Comparing the set of shortest paths to the set of paths that will touch every node, above, an algorithm that calculates an SPT would choose option 2, rather than 3 in the preceding list. In other words, the SPT will ignore the total cost of the edges in the MST to find the shortest path to each reachable destination (in this case, nodes), while the MST will ignore the shortest path to each reachable destination in order to minimize the cost of the entire graph.
网络控制平面最常使用某种形式的贪婪算法来计算 SPT,而不是 MST 。虽然 SPT 并不是解决所有网络流量问题的最佳选择,但在网络控制平面必须解决的流量问题类型上,它们通常比 MST 更好。
Network control planes most often compute SPTs, rather than MSTs, using some form of greedy algorithm. While SPTs are not optimal for solving all network traffic flow problems, they are generally better than MSTs in the types of traffic flow problems that network control planes must solve.
如上一节所述,最短路径规则是消极测试,而不是积极测试;它总是可以用来在一组路径中找到一条无环路径可用路径,但不确定该组中的哪些其他路径也可能是无循环的。如图 12-4所示。
The shortest path rule, as described in the preceding section, is a negative test, rather than a positive one; it can always be used to find a loop-free path among a set of available paths, but not to determine which other paths in the set might also happen to be loop free. Figure 12-4 illustrates.
在图12-4中,很容易观察到从A到目的地的最短路径是沿着路径[A,B,F]。还很容易观察到路径 [A,C,F] 和 [A,D,E,F] 是到达同一目的地的备用路径。但这些路径是无循环的吗?答案取决于无环路的含义:通常,无环路路径是流量不会循环通过任何节点的路径(不会多次访问拓扑中的任何节点)。虽然这个定义通常很好,但在具有多个下一跳的单个节点的情况下,可以缩小定义范围,通过这些下一跳它可以将流量发送到可到达的目的地。具体来说,定义可以缩小为:
In Figure 12-4, it is easy to observe that the shortest path from A to the destination is along the path [A,B,F]. It is also easy to observe that the paths [A,C,F] and [A,D,E,F] are alternate paths to the same destination. But are these paths loop free? The answer depends on the meaning of loop free: normally a loop-free path is one in which the traffic will not loop through any node (will not visit any node in the topology more than once). While this definition is generally good, it is possible to narrow the definition in the case of a single node with multiple next hops over which it can send traffic toward a reachable destination. Specifically, the definition can be narrowed to:
如果下一跳设备不会将朝向特定目的地的流量转发回我(发送节点),则路径是无环路的。
A path is loop free if the next hop device will not forward traffic toward a specific destination back to me (the sending node).
在这种情况下,从 A 的角度来看,如果 C 不通过 A 向目的地转发流量,则通过 C 的路径可以说是无环路。换句话说,如果 A 将数据包发送到 C 的 Destination ,C 将不会转发数据包返回 A,而是将数据包转发到更接近Destination 的位置。这个定义在某种程度上简化了寻找替代无环路路径的问题。A 不需要考虑通往目的地的整个路径,而只需考虑在向目的地转发流量时任何特定邻居是否会将流量转发回 A 本身。
In this case, the path through C, from A’s perspective, can be said to be loop free if C does not forward traffic toward the destination through A. In other words, if A transmits a packet to C for Destination, C will not forward the packet back to A, but rather will forward the packet closer to Destination. This definition simplifies the problem of finding alternate loop-free paths somewhat. Rather than considering the entire path toward the destination, A needs to only consider whether or not any particular neighbor will forward traffic back to A itself when forwarding traffic towards the destination.
例如,考虑路径 [A,C,F]。如果 A 向 C 发送一个数据包,目的地超出 F,C 会将这个数据包转发回 A 吗?C可用的路径是
Consider, for instance, the path [A,C,F]. If A sends a packet to C for the destination beyond F, will C forward this packet back to A? The paths available to C are
• [C,A,B,F],总成本为 5
• [C,A,B,F], with a total cost of 5
• [C,A,D,E],总成本为 6
• [C,A,D,E], with a total cost of 6
• [C,F],总成本为 2
• [C,F], with a total cost of 2
假设C要选择到达目的地的最短路径,它会选择[C,F],因此不会将流量转发回A。这就变成一个问题:为什么C不将流量转发回A?因为它有一条路径比任何经过 A到达目的地的路径成本都要低。这可以概括并称为下游邻居:
Given C is going to choose the shortest path to the destination, it will choose [C,F], and hence will not forward the traffic back to A. Turning this into a question: why will C not forward traffic back to A? Because it has a path that is lower cost than any path through A to reach the destination. This can be generalized and called a downstream neighbor:
任何具有比到达目的地的本地路径短的路径的邻居都不会将流量循环回我(发送节点)。
Any neighbor with a path that is shorter than the local path to the destination will not loop traffic back to me (the sending node).
或者更确切地说,假设本地成本表示为 LC,邻居成本表示为 NC,则
Or rather, given that the local cost is represented as LC, and the neighbor’s cost is represented as NC, then
如果 NC < LC,则邻居位于下游。
If NC < LC, then the neighbor is downstream.
现在考虑图 12-4中所示的第二条替代路径:[A,D,E,F]。再一次,如果 A 将前往目的地的流量发送至 D,D 是否会将流量环回至 A?D 可用的路径是
Now consider the second alternate path shown in Figure 12-4: [A,D,E,F]. Once again, if A sends traffic toward the destination to D, will D loop the traffic back to A? The paths D has available are
• [D,A,C,F],总成本为 5
• [D,A,C,F], with a total cost of 5
• [D,A,B,F],总成本为 4
• [D,A,B,F], with a total cost of 4
• [D,E,F],总成本为 3
• [D,E,F], with a total cost of 3
假设 D 将使用最短的可用路径,D 将通过 E 转发任何此类流量,而不是通过 A 返回。这可以概括并称为无环替代 (LFA):
Assuming D will use the shortest available path, D would forward any such traffic through E, rather than back through A. This can be generalized and called a Loop-Free Alternate (LFA):
任何邻居的路径短于到目的地的本地路径加上邻居到达我(本地节点)的成本,都不会将流量环回我(本地节点)。
Any neighbor with a path that is shorter than the local path to the destination plus the cost of the neighbor to reach me (the local node) will not loop traffic back to me (the local node).
或者更确切地说,假设本地成本表示为 LC,邻居的成本表示为 NC,返回本地节点的成本(从邻居的角度)是 BC:
Or rather, given the local cost is represented as LC, the neighbor’s cost is represented as NC, and the cost back to the local node (from the neighbor’s perspective) is BC:
如果 NC + BC < LC,则邻居是 LFA。
If NC + BC < LC, then the neighbor is an LFA.
还有另外两种模型经常用于解释无循环替代:瀑布模型和 P/Q 空间。更详细地了解这些模型是很有用的。
There are two other models often used to explain Loop-Free Alternates: the waterfall model and P/Q Space. It is useful to look at these models in a little more detail.
防止控制平面计算的路由中出现环路的一种方法是不向将流量转发回我(发送节点)的邻居通告路由。这称为水平分割;它引出了流量流过网络的概念,就像水沿着瀑布或河床流动一样,沿着阻力最小的路径到达目的地,如图 12-5所示。
One way to prevent loops in the routes calculated by a control plane is to simply not advertise routes to neighbors that would forward traffic back to me (the sending node). This is called split horizon; it leads to the concept of traffic flowing through a network acting like water along a waterfall, or stream bed, taking the path of least resistance toward the destination, as shown in Figure 12-5.
在图 12-5中,如果流量在 C(源 2)进入网络并且目的地超出 E,则它将沿着环的右侧流动。然而,如果交通进入如果从 A 处的网络流向 E 之外,则它将沿着环的左侧流动。为了防止目的地为 E 之外的流量在此环上循环,控制平面可以做的一件简单的事情是不允许 A 将目的地通告给 C,或者不允许 C 将目的地通告给 A。向对方发布广告称为水平分割,因为它会阻止路由跨水平线传播,或者更确切地说,阻止路由传播到任何特定设备知道沿特定链路传递的流量将被循环的点。
In Figure 12-5, if traffic enters the network at C (at Source 2) and is destined beyond E, it will flow down the right side of the ring. If, however, traffic enters the network at A and is destined beyond E, it will flow down the left side of the ring. To prevent traffic destined beyond E from looping on this ring, one simple thing the control plane can do is either not allow A to advertise the destination to C, or not allow C to advertise the destination to A. Preventing one of these two routers from advertising to the other is called split horizon, because it stops a route from being propagated across a horizon, or rather beyond the point where any particular device knows traffic being passed along a particular link will be looped.
水平分割的实现方式是仅允许设备通过其不用于到达相关目的地的接口来通告可达性。在这种情况下:
Split horizon is implemented by only allowing a device to advertise reachability through interfaces it is not using to reach the destination in question. In this case:
• D 正在使用 E 到达目的地,因此它不会通告 E 的可达性
• D is using E to reach the destination, so it will not advertise reachability toward E
• C 正在使用D 来到达目的地,因此它不会向D 通告可达性
• C is using D to reach the destination, so it will not advertise reachability toward D
• B 正在使用 E 来到达目的地,因此它不会通告 E 的可达性
• B is using E to reach the destination, so it will not advertise reachability toward E
• A 正在使用 B 来到达目的地,因此它不会向 B 通告可达性
• A is using B to reach the destination, so it will not advertise reachability toward B
因此,A 阻止 B 了解通过 C 到达目的地的备用路径,并且 C 阻止 D 了解通过 A 到达目的地的备用路径。无环路备用路径将穿过此水平分割网络中的点。在图12-5中,A可以计算出C的路径成本小于A的路径成本,因此A向C转发到目的地的任何流量都将沿着A不知道的其他路径转发。C,用 LFA 术语来说,是 A 的下游邻居。
Hence, A blocks B from knowing about the alternate path that it has to the destination through C, and C blocks D from knowing about the alternate path that it has to the destination through A. A Loop-Free Alternate path will cross this split horizon point in the network. In Figure 12-5, A can calculate that C’s path cost is less than A’s path cost, so any traffic A forwards to C toward the destination will be forwarded along some other path than the one A knows about. C, in LFA terms, is a downstream neighbor of A.
那么,查看 LFA 计算的另一种方法是找到环中的水平分割点,并确定水平分割点两侧的设备是否会通过数据包分隔转发流量。
An alternate way to look at the LFA calculation, then, is to find the split horizon point in the ring and determine whether or not the devices on either side of the split horizon point would forward traffic through the packet divide.
描述 LFA 如何工作的另一个模型是P/Q Space; 如图 12-6所示。
Another model to describe how LFAs work is P/Q Space; Figure 12-6 illustrates.
最简单的方法是从两个空间的定义开始。假设要保护 [E,D] 链路免受故障:
It is easiest to begin with a definition of the two spaces. Assuming the [E,D] link is to be protected from failure:
• 从E 计算反向最短路径树(在计算该树时,E 使用朝向自身的路径成本,而不是远离自身的路径成本,因为流量在此路径上流向D)。
• Calculate a reverse Shortest Path Tree from E (E uses the cost of the paths toward itself, rather than the costs away from itself, in calculating this tree, because traffic is flowing toward D on this path).
• 删除[E,D] 链接以及只能通过该链接到达的所有节点。
• Remove the [E,D] link, along with any nodes only reachable by passing through the link.
• E 可以到达的剩余节点是Q 空间。
• The remaining nodes that E can reach are the Q space.
• Calculate a Shortest Path Tree from D.
• 删除[E,D] 链接以及只能通过该链接到达的所有节点。
• Remove the [E,D] link, along with any nodes only reachable by passing through the link.
• D 可以到达的剩余节点在P 空间中。
• The remaining nodes that D can reach are in the P space.
如果 D 可以在 [E,D] 链路发生故障时在 Q 空间中找到将流量转发到的路由器,则这就是 LFA。
If D can find a router in the Q space to which to forward traffic if the [E,D] link fails, this is an LFA.
如果没有 LFA 怎么办?有时可以找到远程无环替代方案 (rLFA),它也可以将流量传送到目的地。rLFA 不是直接连接到计算路由器,但相距一跳或多跳;这意味着流量必须通过计算路由器和远程下一跳之间的路由器进行传输;这通常是通过隧道传输流量来完成的。
What if there is no LFA? It is sometimes possible to find a remote Loop-Free Alternate (rLFA), which can carry the traffic to the destination, as well. The rLFA is not directly connected to the calculating router, but is rather one or more hops away; this means the traffic must be carried through the routers between the calculating router and the remote next hop; this is normally accomplished by tunneling the traffic.
这些模型可以解释 rLFA,而无需查看计算它们所需的数学知识。了解环在何处“划分”为 P 和 Q,或者划分为水平分割的两半,可以帮助您快速了解在何处可以使用 rLFA 来解决故障,即使不存在 LFA。返回到图 12-6,例如,如果 [E,D] 链路发生故障,D 必须等待网络收敛才能开始向目的地转发流量。由于故障,来自 E 的最佳路径已从 D 树中删除,并且 E 没有可将流量转发到的 LFA。
These models can explain rLFAs without looking at the math required to calculate them. Understanding where a ring will “divide” into P and Q, or into the two halves divided by split horizon helps you quickly understand where an rLFA can be used to work around a failure even if no LFA is present. Returning to Figure 12-6, for instance, if the [E,D] link fails, D must simply wait for the network to converge to begin forwarding traffic toward the destination. The best path from E has been removed from D’s tree by the failure, and E has no LFA it can forward traffic to.
返回到本节开头的无环路路径的受限定义 - 设备可以将流量转发到的任何邻居,而无需返回流量。没有特别的原因说明为什么在本地链路故障的情况下设备向其发送数据包的邻居必须是本地连接的。第 9 章“网络虚拟化”描述了创建隧道或覆盖拓扑的能力,该隧道或覆盖拓扑可以在网络中的任意两个节点之间传输流量。
Return to the restricted definition of a loop-free path that this section began with—any neighbor to which a device can forward traffic without the traffic being returned. There is no particular reason why the neighbor to which a device sends packets in the case of a local link failure must be locally connected. Chapter 9, “Network Virtualization,” describes the ability to create a tunnel, or an overlay topology, that can carry traffic between any two nodes in the network.
由于能够通过 C 传输流量,因此 C 不会根据实际目的地转发流量,而是根据隧道标头转发流量,因此 D 可以绕过环路将流量直接转发到 A。当[E,D]链路出现故障时,D可以执行以下操作:
Given the ability to tunnel traffic across C, so C does not forward traffic based on the actual destination, but rather on a tunnel header, D can forward traffic directly to A, bypassing the loop. When the [E,D] link fails, then, D can do the following:
1. 计算网络中流量可以通过隧道传输并且不会返回到 C 本身的最近点。
1. Calculate the closest point in the network where traffic can be tunneled and will not return to C itself.
2. 形成到该路由器的隧道。
2. Form a tunnel to that router.
3. 将流量封装到隧道头中。
3. Encapsulate the traffic into the tunnel header.
4. 转发流量。
4. Forward the traffic.
笔记
Note
在实际实现中,rLFA隧道将被预先计算,而不是在故障时计算。这些 rLFA 隧道也不一定需要对正常转发过程可见。编写本文的目的是为了清楚地说明此过程如何工作,而不是重点介绍其通常如何实施。
In actual implementations, the rLFA tunnel would be precalculated, rather than calculated at the time of failure. These rLFA tunnels do not necessarily need to be visible to the normal forwarding process, as well. This text is arranged for clarity of how this process works, rather than focusing on how it is normally implemented.
D 会将流量转发到隧道目的地,而不是原始目的地;这绕过了 C 的原始目的地的本地转发表条目,这会将流量循环回 C。这种交集的计算点将在第 13 章关于 Dijkstra 的最短路径优先算法的部分中讨论。
D will forward the traffic to the tunnel destination, rather than the original destination; this bypasses C’s local forwarding table entry for the original destination, which would loop the traffic back to C. The calculation of such intersection points will be discussed in the section on Dijkstra’s Shortest Path First algorithm in Chapter 13.
Bellman-Ford 是更容易理解的协议之一,因为它通常是通过将新获知的有关目的地的信息与有关同一目的地的现有信息进行比较来实现的。如果新发现的路由比当前已知的路由更好,则只需在路径列表中替换成本较高的路由 - 正如用于查找网络中无环路路径的最短路径规则所规定的那样。通过以这种方式迭代整个拓扑,找到一组到每个目的地的最短路径。图12-7用于说明该过程。
Bellman-Ford is one of the simpler protocols to understand, as it is generally implemented by comparing newly learned information about a destination with existing information about the same destination. If the newly discovered route is better than the currently known route, the higher cost route is simply replaced in the path list—as dictated by the shortest path rule for finding loop-free paths through the network. By iterating over the entire topology in this way, a set of shortest paths to each destination is found. Figure 12-7 is used to illustrate the process.
虽然 Bellman-Ford 以其在路由信息协议 (RIP)等广泛部署的协议中实现的分布式变体而闻名,但它最初被设计为在描述节点和边缘拓扑的单一结构上执行的搜索算法。此处将贝尔曼-福特算法作为一种算法进行讨论。下一节将讨论类似于 Bellman-Ford 的分布式算法。
While Bellman-Ford is mostly known for its distributed variant implemented in widely deployed protocols such as the Routing Information Protocol (RIP), it was originally designed as a search algorithm performed on a single structure describing a topology of nodes and edges. Bellman-Ford is discussed as an algorithm here. A distributed algorithm similar to Bellman-Ford is discussed in the next section.
用于计算最短路径树的任何算法的实际运行时间通常都会被通过网络传输有关拓扑变化的信息所需的时间所淹没;有关此主题的更多信息,请参阅第 14 章“对拓扑变化做出反应”。所有这些协议的实现,特别是分布式形式,将包含许多优化,以将其运行时间减少到远低于最坏情况,因此,虽然给出最坏情况作为参考点,但它通常很少(或没有)影响实际部署网络中每种算法的性能。
The actual runtime of any algorithm used for calculating a Shortest Path Tree is normally swamped by the amount of time required to carry information about topology changes through the network; see Chapter 14, “Reacting to Topology Changes,” for more information on this topic. Implementations of all of these protocols, particularly in their distributed form, will contain a number of optimizations to reduce their runtime to far below the worst case, so while the worst case is given as a reference point, it often has little (or no) bearing on the performance of each algorithm in actual deployed networks.
要在此拓扑上运行 Bellman-Ford,必须首先将其转换为一组向量和距离,并存储在数据结构中,如表 12-1所示。
To run Bellman-Ford over this topology, it must first be converted into a set of vectors and distances, and stored in a data structure, such as shown in Table 12-1.
表 12-1 拓扑或边,表示为 Bellman-Ford 的表
Table 12-1 Topology, or Edges, Represented as a Table for Bellman-Ford
排 Row |
来源 Source (s) |
目的地 (d) Destination (d) |
距离(成本) Distance (cost) |
1 1 |
女 (6) F (6) |
克 (7) G (7) |
1 1 |
2 2 |
乙 (5) E (5) |
高 (8) H (8) |
1 1 |
3 3 |
丁 (4) D (4) |
高 (8) H (8) |
2 2 |
4 4 |
丁 (4) D (4) |
乙 (5) E (5) |
1 1 |
5 5 |
乙 (2) B (2) |
女 (6) F (6) |
1 1 |
6 6 |
乙 (2) B (2) |
乙 (5) E (5) |
2 2 |
7 7 |
丙 (3) C (3) |
丁 (4) D (4) |
1 1 |
8 8 |
一个 (1) A (1) |
乙 (2) B (2) |
2 2 |
9 9 |
一个 (1) A (1) |
丙 (3) C (3) |
1 1 |
该表中有九个条目,因为网络中有九个链路(边缘)。最短路径算法计算一棵单向树(沿图的一个方向)。在图 12-7的网络中,SPT 显示为源自节点 1,计算显示为远离节点 1,这将是计算发生的点。该算法的伪代码如下:
There are nine entries in this table because there are nine links (edges) in the network. Shortest path algorithms calculate a unidirectional tree (in one direction along the graph). In the network in Figure 12-7, the SPT is shown originating at node 1, and calculation is shown moving away from node 1, which will be the point from which the calculation takes place. The algorithm, in pseudocode, is as follows:
笔记
Note
本例中的数据结构是 1 引用(或基于)的,这意味着第一行是 1 而不是 0,以使编号更清晰。
The data structures in this example are 1 referenced (or based), which means the first row is 1 rather than 0, to make the numbering clearer.
// 创建一个集合来保存响应,每个节点都有一个条目
// 结果结构中的第一个槽将代表节点 1,
// 第二个节点 2 等
define Route[nodes] {
previous // 作为节点
cost // 作为整数
}
// 将源 (me) 设置为 0 cost
// 数组中的位置 1 是起始点的条目
Route[1].predecessor = NULL
Route[1].cost = 0
// 上面的表 1 保存在名为 topo 的数组中
// 对于路线中的每个条目,遍历 topo(边)表一次
// (结果) 表,用较短的条目替换较长的条目
i = Nodes
while i > 0 {
j = 1
while j <= Nodes { // 迭代拓扑表中的每一行
source_router = topo[j].s
destination_router = topo[j] .d
link_cost = topo[j].cost
如果路由[source_router].cost == NULL {
source_router_cost = INFINITY
} else {
source_router_cost = 路由[source_router].cost
}
ifroute[destination_router].cost == NULL {
destination_router_cost = INFINITY
} else {
destination_router_cost =route[destination_router].cost
}
if source_router_cost + link_cost <=destination_router_cost {
route[destination_router].cost = source_router_cost + link_costroute
[destination_router].predecessor = source_router
}
j = j + 1 //或j++,具体取决于什么伪代码这代表
}
我 = 我 - 1
}
// create a set to hold the response, with one entry for each node
// the first slot in the resulting structure will represent node 1,
// the second node 2 etc.
define route[nodes] {
predecessor // as a node
cost // as an integer
}
// set the source (me) to 0 cost
// position 1 in the array is the origination point’s entry
route[1].predecessor = NULL
route[1].cost = 0
// table 1, above, is held in an array called topo
// walk the topo (edges) table once for each entry in the route
// (results) table, replacing longer entries with shorter ones
i = nodes
while i > 0 {
j = 1
while j <= nodes { // iterates over every row in the topology table
source_router = topo[j].s
destination_router = topo[j].d
link_cost = topo[j].cost
if route[source_router].cost == NULL {
source_router_cost = INFINITY
} else {
source_router_cost = route[source_router].cost
}
if route[destination_router].cost == NULL {
destination_router_cost = INFINITY
} else {
destination_router_cost = route[destination_router].cost
}
if source_router_cost + link_cost <= destination_router_cost {
route[destination_router].cost = source_router_cost + link_cost
route[destination_router].predecessor = source_router
}
j = j + 1 //or j++ depending on what pseudocode this is representing
}
i = i - 1
}
该代码看起来比实际情况更复杂,因此具有欺骗性。关键行是比较ifroute[topo[j].s].cost + topo[j].cost <route[topo[j].d].cost; 通过一个例子来关注这一点是很有用的。在第一次运行外循环时(对结果表中的每个条目运行一次,此处称为路由):
This code is deceptive in appearing more complex than it really is. The key line is the comparison if route[topo[j].s].cost + topo[j].cost < route[topo[j].d].cost; it is useful to focus on this line through an example. In the first run through the outer loop (which is run once for each entry in the results table, called route here):
• 对于拓扑表的第一行:
• For the first line of the topo table:
• j为 1,因此topo[j].s为节点 6 (F),即边表中向量的源
• j is 1 so topo[j].s is node 6 (F), the source of the vector in the edge table
• j为 1,因此 topo [j].d为节点 7 (G),即边表中向量的目的地
• j is 1, so topo[j].d is node 7 (G), the destination of the vector in the edge table
• 路线[6].cost =无穷大,topo[1].cost = 1,并且路线[7].cost = 无穷大
• route[6].cost = infinity, topo[1].cost = 1, and route[7].cost = infinity
•无穷大 + 1 == 无穷大,因此条件失败并且不会发生其他情况
• infinity + 1 == infinity, so the condition fails and nothing else happens
• 源成本为无穷大的任何拓扑表条目都将给出相同的结果,因为无穷大+任何值将始终等于无穷大;包含成本为无穷大的源的其余行将被跳过。
• Any topo table entry with a source cost of infinity will give the same result, as infinity + anything will always equal infinity; the rest of the rows containing a source with a cost of infinity will be skipped.
• 对于拓扑表的第八行(第八条边):
• For the eighth line of the topo table (the eighth edge):
• j为 8,因此topo[j].s为节点 1 (A),即边表中向量的源
• j is 8, so topo[j].s is node 1 (A), the source of the vector in the edge table
• j为 8,因此 topo [j].d为节点 2 (B),即边表中向量的目的地
• j is 8, so topo[j].d is node 2 (B), the destination of the vector in the edge table
• 路线[1].cost = 0,topo[8].cost = 2,路线[2].cost = 无穷大
• route[1].cost = 0, topo[8].cost = 2, and route[2].cost = infinity
• 0 + 2 <= 无穷大,因此条件成功
• 0 + 2 <= infinity, so the condition succeeds
• route[2].predecessor 设置为 1,route[2].cost 设置为 2
• route[2].predecessor is set to 1, and route[2].cost is set to 2
• 对于拓扑表的第九行(第九条边):
• For the ninth line of the topo table (the ninth edge):
• j为 9,因此topo[j].s为节点 1 (A),即边表中向量的源
• j is 9, so topo[j].s is node 1 (A), the source of the vector in the edge table
• j为 9,因此 topo [j].d为节点 3 (C),即边表中向量的目的地
• j is 9, so topo[j].d is node 3 (C), the destination of the vector in the edge table
• 路线[1].cost = 0,topo[9].cost = 1,路线[3].cost = 无穷大
• route[1].cost = 0, topo[9].cost = 1, and route[3].cost = infinity
• 0 + 1 <= 无穷大,因此条件成功
• 0 + 1 <= infinity, so the condition succeeds
• route[3].predecessor设置为 1,route[3].cost设置为 1
• route[3].predecessor is set to 1, and route[3].cost is set to 1
在外循环的第二次运行中:
In the second run of the outer loop:
• 对于拓扑表的第五行(第五条边):
• For the fifth line of the topo table (the fifth edge):
• j为 5,因此topo[j].s为节点 2 (B),即边表中向量的源
• j is 5, so topo[j].s is node 2 (B), the source of the vector in the edge table
• j为 5,因此 topo [j].d为节点 6 (F),即边表中向量的目的地
• j is 5, so topo[j].d is node 6 (F), the destination of the vector in the edge table
• 路线[2].cost = 2,topo[5].cost = 1,路线[6].cost = 无穷大
• route[2].cost = 2, topo[5].cost = 1, and route[6].cost = infinity
• 2 + 1 <= 无穷大,因此条件成立
• 2 + 1 <= infinity, so the condition succeeds
• route[6].predecessor设置为 2,route[6].cost设置为 3
• route[6].predecessor is set to 2, and route[6].cost is set to 3
• 对于拓扑表的第六行(第六条边):
• For the sixth line of the topo table (the sixth edge):
• j为 6,因此topo[j].s为 2 (B),边表中向量的来源
• j is 6, so topo[j].s is 2 (B), the source of the vector in the edge table
• j为 6,因此topo[j].d为 5 (E),即边表中向量的目的地
• j is 6, so topo[j].d is 5 (E), the destination of the vector in the edge table
• 路线[2].cost = 2、topo[6].cost = 2、路线[5].cost = 无穷大
• route[2].cost = 2, topo[6].cost = 2, and route[5].cost = infinity
• 2 + 2 <= 无穷大,因此条件成立
• 2 + 2 <= infinity, so the condition succeeds
• route[5].predecessor设置为 2,route[5].cost设置为 4
• route[5].predecessor is set to 2, and route[5].cost is set to 4
•该运行的剩余部分如表12-2所示。
• The remainder of this run is shown in Table 12-2.
在外环的第三次运行中,节点 8 特别令人感兴趣,因为有两条路径到达该目的地。
In the third run of the outer loop, node 8 is of particular interest, as there are two paths to this destination.
• 对于地形表的第二行(第二条边):
• For the second line of the topo table (the second edge):
• j为 2,因此topo[j].s为节点 5 (E),即边表中向量的源
• j is 2, so topo[j].s is node 5 (E), the source of the vector in the edge table
• j为 2,因此 topo [j].d为节点 8 (H),即边表中向量的目的地
• j is 2, so topo[j].d is node 8 (H), the destination of the vector in the edge table
• 路线[5].cost = 4,topo[2].cost = 1,路线[8].cost = 无穷大
• route[5].cost = 4, topo[2].cost = 1, and route[8].cost = infinity
• 4+1 <= 无穷大,因此条件成立
• 4+1 <= infinity, so the condition succeeds
• route[8].predecessor设置为 5,route[8].cost设置为 5
• route[8].predecessor is set to 5, and route[8].cost is set to 5
• 对于拓扑表的第三行(第三条边):
• For the third line of the topo table (the third edge):
• j为 3,因此topo[j].s为节点 4 (D),即边表中向量的源
• j is 3, so topo[j].s is node 4 (D), the source of the vector in the edge table
• j为 3,因此topo[j].d为节点 8 (H),即边表中向量的目的地
• j is 3, so topo[j].d is node 8 (H), the destination of the vector in the edge table
• 路线[4].cost = 2、topo[3].cost = 2、路线[8].cost = 5
• route[4].cost = 2, topo[3].cost = 2, and route[8].cost = 5
• 2+2 <= 4,因此条件成立
• 2+2 <= 4, so the condition succeeds
• route[8].predecessor设置为 4,route[8].cost设置为 4
• route[8].predecessor is set to 4, and route[8].cost is set to 4
拓扑表第三个循环中有趣的一点是首先处理边 [5,8] 的条目,这将 8 的(H 的)前驱设置为 5,成本设置为 5。当处理拓扑表中的下一行时然而,在 [4,8] 边上,算法会发现一条到节点 8 的较短路径并替换现有路径。表 12-2显示了每次通过拓扑表时路由表的状态。
The interesting point in the third cycle through the topo table is the entry for the edge [5,8] is processed first, which sets 8’s (H’s) predecessor to 5 and cost to 5. When the next line in the topo table is processed, however, the [4,8] edge, the algorithm discovers a shorter path to node 8 and replaces the existing one. Table 12-2 shows the state of the route table with each pass through the topo table.
Table 12-2 Bellman-Ford Cycles Across the Sample Network
|
一个 (1) A (1) |
乙 (2) B (2) |
丙 (3) C (3) |
丁 (4) D (4) |
乙 (5) E (5) |
女 (6) F (6) |
克 (7) G (7) |
高 (8) H (8) |
||||||||
磷 P |
C C |
磷 P |
C C |
磷 P |
C C |
磷 P |
C C |
磷 P |
C C |
磷 P |
C C |
磷 P |
C C |
磷 P |
C C |
|
第一周期 First Cycle |
氮 N |
0 0 |
1 1 |
2 2 |
1 1 |
1 1 |
氮 N |
我 I |
氮 N |
我 I |
氮 N |
我 I |
氮 N |
我 I |
氮 N |
我 I |
第二周期 Second Cycle |
氮 N |
0 0 |
1 1 |
2 2 |
1 1 |
1 1 |
3 3 |
2 2 |
2 2 |
4 4 |
2 2 |
3 3 |
氮 N |
我 I |
氮 N |
我 I |
第三周期 Third Cycle |
氮 N |
0 0 |
1 1 |
2 2 |
1 1 |
1 1 |
3 3 |
2 2 |
2 2 |
4 4 |
2 2 |
3 3 |
6 6 |
4 4 |
4 4 |
4 4 |
在表12-2中,顶行代表路由表中的条目和网络中可达的节点。例如,A(1)代表到A的最佳路径,B(2)代表到B的最佳路径等。P列代表前驱,或者A必须经过的节点才能到达指示的目的地。C代表到达该目的地的成本。如果算法被编码来检测树的完成情况,则示例网络可以在三个周期内完成。如图所示,伪代码没有对此完成进行任何测试,并且无论如何都会运行完整的 8 个周期(每个节点一个)。
In Table 12-2, the top line represents an entry in the routing table and a node that is reachable in the network. For instance, A (1) represents the best path to A, B (2) represents the best path to B, etc. The P column represents the predecessor, or the node through which A must pass to reach the destination indicated. The C represents the cost to reach this destination. The sample network can be completed in three cycles, given the algorithm is coded to detect the completion of the tree. The pseudocode, as shown, does not have any test for this completion and would run the full 8 cycles (one for each node) anyway.
笔记
Note
Bellman-Ford 还可以支持负成本边(与 Dijkstra 算法不同);由于这些通常不存在于网络中,因此此处未显示处理这些的过程。
Bellman-Ford can also support negative cost edges (unlike Dijkstra’s algorithm); as these do not normally exist in a network, the process for handling these is not shown here.
扩散更新算法 (DUAL) 是此处讨论的两种算法之一,最初设计用于在分布式网络中实现。它的独特之处在于还删除了算法状态机中包含的可达性和拓扑信息。这里讨论的其他算法将信息的删除留给协议的实现,而不是在算法本身内考虑算法操作的这个方面。
The Diffusing Update Algorithm (DUAL) is one of the two algorithms discussed here originally designed to be implemented in a distributed network. It is unique in also having the removal of reachability and topology information contained in the algorithm’s state machine. The other algorithms discussed here leave the removal of information to the implementation of the protocol, rather than considering this aspect of the algorithm’s operation within the algorithm itself.
由于 DUAL 被设计为分布式算法,因此最好描述其跨网络的操作;图 12-8和图 12-9用于此目的。为了解释 DUAL,此示例将跟踪 A 了解三个目的地,然后处理这些相同目的地的可达性状态变化的流程。第一个示例将考虑存在备用路径但没有下游邻居的情况;第二个将考虑存在备用路径和下游邻居的情况。
As DUAL is designed as a distributed algorithm, it is best to describe its operation across a network; Figure 12-8 and Figure 12-9 are used for this purpose. To explain DUAL, this example will trace the flow of A learning about three destinations and then processing changes in the state of reachability for these same destinations. The first example will consider the case where there is an alternate path, but no downstream neighbor; the second will consider the case there is an alternate path and a downstream neighbor.
笔记
Note
虽然最初的 DUAL 论文提到了邻居邻接,但本次讨论中不会对其进行描述。相反,将简单地假设存在这样的邻居,因此控制平面数据的传输是可靠的。
While the original DUAL paper refers to neighbor adjacencies, they will not be described in this discussion. Rather, it will simply be assumed such neighbors exist, and hence the transmission of control plane data is reliable.
图12-8中,从A的角度学习D:
In Figure 12-8, learning D from A’s perspective:
1. A 学习到 D 的两条路径:
1. A learns two paths to D:
A。通过 H,成本为 3。
a. Through H with a cost of 3.
b. 通过C,成本为4。
b. Through C with a cost of 4.
2. A不会学习经过B的路径,因为B正在使用A作为其后继者:
2. A will not learn the path through B, because B is using A as its successor:
A。A是B到达D的最佳路径。
a. A is the best path B has to reach D.
b. 由于 B 使用经过 A 的路径到达 D(目的地),因此它不会将其所知道的有关 D(经过 C)的路由通告给 A。
b. As B is using the path through A to reach D (the destination), it will not advertise the route it knows about D (through C) to A.
C。B 将水平分割向 A 通告 D,以防止形成可能的转发环路。
c. B will split horizon its advertisement of D toward A to prevent possible forwarding loops from forming.
3. A 比较可用路径并选择最短路径作为无环路:
3. A compares the available paths and chooses the shortest path as loop free:
A。通过 H 的路径被标记为后继路径。
a. The path through H is marked as the successor.
b. 可行距离设置为沿最短路径的成本,即 3。
b. The feasible distance is set to the cost along the shortest path, which is 3.
4. A 检查剩余路径以确定其中是否有下游邻居:
4. A checks the remaining paths to determine if any of them are downstream neighbors:
A。C的成本是3。
a. C’s cost is 3.
A 知道这一点,因为 C 使用其本地度量(即 3)将路由通告给 D。A 将 C 的本地度量保存在其拓扑表中。
A knows this because C advertises the route to D with its local metric, which is 3. A saves C’s local metric in its topology table.
因此,A 知道 C 处的局部成本和 A 处的局部成本。
Hence, A knows the local cost at C and the local cost at A.
b. 3(C处的成本)>=3(A处的成本),因此这条路线可能是环路,因此C不满足可行性条件。
b. 3 (the cost at C) >= 3 (the cost at A), so this route may be a loop, Hence, C does not meet the feasibility condition.
C。C 未标记为下游邻居。
c. C is not marked as a downstream neighbor.
下游邻居在 DUAL 中称为可行后继者。
Downstream neighbors are called feasible successors in DUAL.
假设[A,H]链路发生故障。DUAL 不依赖于定期更新,因此 A 不能简单地等待另一次具有有效信息的更新;相反,A 必须积极寻求替代路径。因此,这是替代路径发现的分散过程。如果 [A,H] 链接失败,则仅考虑 D:
Assume the [A,H] link fails. DUAL does not rely on periodic updates, so A cannot simply wait for another update with valid information; rather A must actively pursue an alternate path. This is, therefore, a diffused process of alternate path discovery. If the [A,H] link fails, considering just D:
1. A 检查其本地表中是否有任何可行的后继者(下游邻居)。
1. A examines its local table for any feasible successors (downstream neighbors).
2. 没有可行的后继路径,因此 A 必须发现一条到 D 的备用无环路径(如果存在)。
2. There are no feasible successors, so A must discover an alternate loop-free path to D (if one exists).
3. A 向每个邻居发送查询,以确定是否存在通往 D 的备用无环路路径。
3. A sends a query to each neighbor to determine if there is some alternate loop-free path to D.
4.在C:
4. At C:
A。C 的后继者是 E(不是 A,它从 A 处收到查询)。
a. C’s successor is E (not A, from whom it received the query).
b. E 的成本低于 A 对 D 的成本;因此C的路径不是循环。
b. E’s cost is lower than A’s cost to D; hence C’s path is not a loop.
C。C向 A回复其当前度量 3。
c. C replies with its current metric of 3 to A.
5. 在 B 点:
5. At B:
A。A是B的现任继任者。
a. A is B’s current successor.
b. 通过查询,B现在发现它到D的最佳路径已经失败,它还必须找到一条替代路径。
b. Through the query, B now discovers its best path to D has failed, and it must also find an alternate path.
C。这里不考虑B的处理,而是留给读者作为练习。
c. B’s processing is not considered here, but rather is left as an exercise for the reader.
d. B 回复 A 它没有替代路径(以无限度量进行响应)。
d. B replies to A that it has no alternate path (responds with an infinite metric).
6. A 收到以下回复:
6. A receives these replies:
A。通过 C 的路径是唯一可用的,成本为 4。
a. The path through C is the only one available, with a cost of 4.
b. A 将经过 C 的路径标记为其后继。
b. A marks the path through C as its successor.
C。没有其他路径可以到达 D;因此没有可行的后继者(下游邻居)。
c. There are no other paths to D; hence there is no feasible successor (downstream neighbor).
图12-9中,目的地(D)已从H移动到E;这将用于第二个示例。
In Figure 12-9, the destination (D) has been moved from H to E; this will be used for the second example.
在这个例子中,有一个可行的后继者(下游邻居)。从A的角度学习D:
In this example, there is a feasible successor (downstream neighbor). Learning D from A’s perspective:
1. A 学习到 D 的两条路径:
1. A learns two paths to D:
A。通过 H,成本为 4。
a. Through H with a cost of 4.
b. 通过C,成本为3。
b. Through C with a cost of 3.
2. A不会学到任何经过B的路径:
2. A will not learn any path through B:
A。B有两条通往D的路径。
a. B has two paths to D.
b. 通过C和A,成本为4。
b. Through both C and A with a cost of 4.
C。在本例中,B 使用 A 和 C 作为其后继者。
c. B is using both A and C as its successors in this case.
d. B 会将D 的广告水平分割给 A,因为 A 被标记为后继者。
d. B will split horizon its advertisement of D toward A because A is marked as a successor.
3. A 比较可用路径并选择最短路径作为无环路:
3. A compares the available paths and chooses the shortest path as loop free:
A。通过 C 的路径被标记为后继路径。
a. The path through C is marked as the successor.
b. 可行距离设置为沿最短路径的成本,即 3。
b. The feasible distance is set to the cost along the shortest path, which is 3.
4. A 检查剩余路径以确定其中是否有下游邻居:
4. A checks the remaining paths to determine if any of them are downstream neighbors:
A。H的成本为2。
a. H’s cost is 2.
b. 2(H处的成本)<=3(A处的成本),所以这条路由不能是环路;因此H确实满足可行性条件。
b. 2 (the cost at H) <= 3 (the cost at A), so this route cannot be a loop; hence H does meet the feasibility condition.
C。H 被标记为可行后继(下游邻居)。
c. H is marked as a feasible successor (downstream neighbor).
如果仅考虑 A,[A,C] 链接失败:
If the [A,C] link fails just considering A:
1. A 将检查其本地拓扑表以寻找可行的后继者。
1. A will examine its local topology table for a feasible successor.
2. 通过 H 存在可行的后继者。
2. A feasible successor exists through H.
3. A 将其本地表切换到 H 作为最佳路径。
3. A switches its local table to H as the best path.
A。尚未运行扩散更新,因此未验证或重新计算任何路径。
a. No diffusing update has been run, so no paths have been verified or recalculated.
b. 因此,可行距离不能改变;它仍然是3。
b. Hence, the feasible distance cannot be changed; it remains at 3.
4. A 向其邻居发送更新,指出其到达 D 的成本已从 3 更改为 4。
4. A sends an update to its neighbors noting its cost to reach D has changed from 3 to 4.
此处不描述此更新的影响,但请考虑 B 正在使用 A 作为后继者。
The impact of this update is not described here, but consider that B is using A as a successor.
正如您所看到的,存在可行后继者时的处理比不存在时要快得多且简单得多。在部署了使用 DUAL(特别是 EIGRP)的路由协议的网络中,一个主要设计目标是限制在没有可行后继者的情况下生成的任何查询的范围。查询范围是 DUAL 算法完成速度以及网络收敛速度的主要决定因素。
As you can see, processing when a feasible successor exists is much faster and simpler than without. In networks where a routing protocol using DUAL (specifically EIGRP) has been deployed, one primary design goal will be limiting the scope of any queries generated in the case where there is no feasible successor. Query scope is the primary determinant of how quickly the DUAL algorithm completes and hence how quickly the network converges.
图 12-10说明了基本的 DUAL 有限状态机。
Figure 12-10 illustrates a basic DUAL finite state machine.
路线中包含的事情变得更糟可能包括
Things included in route gets worse could include
• 连接的链路或邻居发生故障
• Failure of a connected link or neighbor
• 接收具有更高度量的路由更新
• Receiving an update for a route with a higher metric
• 从当前后继路由接收查询变得更糟
• Receiving a query from the current successor route gets worse
Things included in route gets better could include
• 从邻居处获知的新路由
• A new route learned from a neighbor
• 发现一个新邻居,以及该邻居可以到达的路由
• A new neighbor discovered, along with the routes this neighbor can reach
• 当路线变得更糟时接收发送给邻居的所有查询
• Receiving all queries sent to neighbors when a route gets worse
本章是讨论计算通过网络的无环路路径的两章中的第一章。最短路径规则是大多数计算机制的基础;Bellman-Ford 和 DUAL,是最广泛部署的距离矢量协议的基础(第 15 章到第 17章更深入地考虑了协议的分类,其中讨论了分布式和集中式控制平面)。下一章将考虑另一种算法,该算法依赖于最短路径规则,然后转向路径向量,最后转向不相交路径。
This chapter is the first of two discussing calculating loop-free paths through a network. The shortest path rule is the foundation of most calculation mechanisms; Bellman-Ford and DUAL, the foundation of most widely deployed distance-vector protocols (classifications of protocols are considered in more depth in Chapters 15 through 17, which discuss distributed and centralized control planes). The next chapter considers one more algorithm that relies on the shortest path rule, and then turns to path vector, and finally disjoint paths.
贝尔曼、理查德. “关于路由问题。” 应用数学季刊16(1958):87-90。
Bellman, Richard. “On a Routing Problem.” Quarterly of Applied Mathematics 16 (1958): 87–90.
“增强型内部网关路由协议 (EIGRP) 宽指标白皮书。” 思科。访问日期:2017 年 1 月 28 日。http ://www.cisco.com/c/en/us/products/collateral/ios-nx-os-software/enhanced-interior-gateway-routing-protocol-eigrp/whitepaper_C11-720525。 html。
“Enhanced Interior Gateway Routing Protocol (EIGRP) Wide Metrics White Paper.” Cisco. Accessed January 28, 2017. http://www.cisco.com/c/en/us/products/collateral/ios-nx-os-software/enhanced-interior-gateway-routing-protocol-eigrp/whitepaper_C11-720525.html.
福特,LR网络流理论。加利福尼亚州圣莫尼卡:兰德公司,1956 年。
Ford, L. R. Network Flow Theory. Santa Monica, CA: RAND Corporation, 1956.
Garcia-Luna-Aceves,JJ“使用扩散计算的无环路由。” IEEE/ACM 网络交易1,编号。1(1993 年 2 月):130-41。
Garcia-Luna-Aceves, J. J. “Loop-Free Routing Using Diffusing Computations.” IEEE/ACM Transactions on Networking 1, no. 1 (February 1993): 130–41.
Hendrick, C.路由信息协议。征求意见 1058。RFC 编辑,1988。doi:10.17487/rfc1058。
Hendrick, C. Routing Information Protocol. Request for Comments 1058. RFC Editor, 1988. doi:10.17487/rfc1058.
马尔金,加里· S。RIP 第 2 版。征求意见 2453。RFC 编辑,1998。doi:10.17487/rfc2453。
Malkin, Gary S. RIP Version 2. Request for Comments 2453. RFC Editor, 1998. doi:10.17487/rfc2453.
马尔金、加里·S.和罗伯特·E·明尼尔。用于 IPv6 的 RIPng。征求意见 2080。RFC 编辑,1997。doi:10.17487/rfc2080。
Malkin, Gary S., and Robert E. Minnear. RIPng for IPv6. Request for Comments 2080. RFC Editor, 1997. doi:10.17487/rfc2080.
摩尔,爱德华·F。《穿过迷宫的最短路径》。1957 年国际开关理论研讨会论文集,第二部分。剑桥马萨诸塞州:哈佛大学出版社,1959 年。
Moore, Edward F. “The Shortest Path through a Maze.” In Proceedings of the International Symposium on Switching Theory 1957, Part II. Cambridge MA: Harvard University Press, 1959.
帕尔曼、拉迪亚. “扩展 LAN 中生成树的分布式计算算法。” SIGCOMM 计算机通信评论15,编号。4(1985 年 9 月):44-53。号码:10.1145/318951.319004。
Perlman, Radia. “An Algorithm for Distributed Computation of a Spanningtree in an Extended LAN.” SIGCOMM Computer Communication Review 15, no. 4 (September 1985): 44–53. doi:10.1145/318951.319004.
———。互连:网桥、路由器、交换机和网络协议。第二版。马萨诸塞州雷丁:Addison-Wesley Professional,1999。
———. Interconnections: Bridges, Routers, Switches, and Internetworking Protocols. 2nd edition. Reading, MA: Addison-Wesley Professional, 1999.
雷塔纳、阿尔瓦罗、拉斯·怀特和唐·斯莱斯。IP 的 EIGRP:基本操作和配置。第一版。马萨诸塞州波士顿:Addison-Wesley Professional,2000 年。
Retana, Alvaro, Russ White, and Don Slice. EIGRP for IP: Basic Operation and Configuration. 1st edition. Boston, MA: Addison-Wesley Professional, 2000.
拉斯·怀特. “CAP 定理和路由。” 规则 11 读者,2016 年 3 月 25 日。https: //rule11.tech/cap-theorem-routing/。
Russ White. “CAP Theorem and Routing.” Rule 11 Reader, March 25, 2016. https://rule11.tech/cap-theorem-routing/.
———。“下令 FIB。” 数据包推送者,2014 年 3 月 25 日。http: //packetpushers.net/ordered-fib/。
———. “Ordered FIB.” Packet Pushers, March 25, 2014. http://packetpushers.net/ordered-fib/.
———。“视频:远程 LFA 真的能解决微循环吗?” 规则 11 读者,2017 年 9 月 11 日。https ://rule11.tech/video-remote-lfas-really-solve-microloops/。
———. “Video: Do Remote LFAs Really Solve Microloops?” Rule 11 Reader, September 11, 2017. https://rule11.tech/video-remote-lfas-really-solve-microloops/.
萨维奇、唐尼、史蒂文·摩尔、詹姆斯·吴、拉斯·怀特、唐纳德·斯莱斯和彼得·帕鲁克。思科的增强型内部网关路由协议 (EIGRP)。征求意见 7868。RFC 编辑器,2016。https: //rfc-editor.org/rfc/rfc7868.txt。
Savage, Donnie, Steven Moore, James Ng, Russ White, Donald Slice, and Peter Paluch. Cisco’s Enhanced Interior Gateway Routing Protocol (EIGRP). Request for Comments 7868. RFC Editor, 2016. https://rfc-editor.org/rfc/rfc7868.txt.
Shimbel, A.“通信网络的结构”。信息网络研讨会论文集。纽约:布鲁克林理工学院理工出版社,nd,199-203。
Shimbel, A. “Structure in Communication Nets.” In Proceedings of the Symposium on Information Networks. New York: Polytechnic Press of the Polytechnic Institute of Brooklyn, n.d., 199–203.
1. 解释网络中最短路径和无环路径计算之间的关系。
1. Explain the relationship between the calculation of shortest paths and loop-free paths through a network.
2. 备用路径必须满足哪些条件才能被视为无环备用路径?
2. What are the conditions an alternate path must meet to be considered a Loop-Free Alternate?
3. 解释瀑布模型和 P/Q 空间模型之间的区别,以了解环路将在何处形成,使用网络图,该网络图包含环中的七个路由器以及可通过其中一个路由器到达的单个目的地。
3. Explain the difference between the waterfall and P/Q space models of under-standing where loops will form using a network diagram containing seven routers in a ring and a single destination reachable through one of these routers.
4. 解决无环路径问题的算法什么时候被称为“贪婪”算法?
4. When is an algorithm for solving the problem of loop-free paths called “greedy”?
5. 将 DUAL 章节中给出的状态机与 EIGRP RFC 中给出的状态机进行比较。遗漏了什么,合并了什么等等?拥有或多或少详细的状态机图有哪些优点和缺点?您什么时候会选择其中之一?
5. Compare the state machine given in the chapter for DUAL to the state machine given in the EIGRP RFC. What is left out, what is combined, etc.? What are the advantages and disadvantages of having more or less detailed state machine diagrams? When would you prefer one or the other?
6. 绘制一个大约 10 或 11 个节点的小型网络,并演练在其上运行 Bellman-Ford 和扩散更新算法的过程。DUAL 会在此网络中找到任何无环路替代吗?是否有任何地方可以计算远程无环替代?
6. Draw a small network of around 10 or 11 nodes, and walk through the process of running the Bellman-Ford and Diffusing Update algorithms on it. Will DUAL find any Loop-Free Alternates in this network? Are there any places where a remote Loop-Free Alternate can be calculated?
7. 在问题 6 的网络中,假设单个链路发生故障;追踪 DUAL 对这个事件的反应?是否需要查询?为什么或者为什么不?
7. In the network from question 6, assume a single link has failed; trace the reaction of DUAL to this event? Will queries be required? Why or why not?
1 . Shimbel,“通信网络的结构”。
1. Shimbel, “Structure in Communication Nets.”
2 . 摩尔,《穿过迷宫的最短路径》。
2. Moore, “The Shortest Path through a Maze.”
3 . 贝尔曼,“关于路由问题”。87–90。
3. Bellman, “On a Routing Problem.” 87–90.
4 . 福特,网络流理论。
4. Ford, Network Flow Theory.
5 . Garcia-Luna-Aceves,“使用扩散计算的无环路由。” 130-41。
5. Garcia-Luna-Aceves, “Loop-Free Routing Using Diffusing Computations.” 130–41.
前一章讨论了最短路径规则和两种算法(或者可能是系统)来查找网络中的无环路路径。此类系统的范围很广——太多了,无法在一本大书的几章中涵盖——但对于网络工程师来说,至少熟悉其中的一些系统很重要。本章讨论 Dijkstra 的最短路径优先、路径向量以及两种不同的不相交路径算法:Suurballe 树和最大冗余树 (MRT)。最后,本章将考虑控制平面需要解决的另一个问题:确保通过网络的双向连接。
The preceding chapter discussed the shortest path rule and two algorithms (or perhaps systems) to find loop-free paths through a network. There is a wide range of such systems—far too many to cover in a few chapters of a larger book—but it is important for network engineers to be familiar with at least a few of these systems. This chapter considers Dijkstra’s Shortest Path First, Path Vector, and two different disjoint path algorithms: Suurballe’s and Maximally Redundant Trees (MRTs). Finally, this chapter will consider one other problem that control planes need to solve: ensuring two-way connectivity through the network.
Dijkstra 的最短路径优先 (SPF) 算法也许是最广泛认可和理解的用于发现网络中无环路路径的系统。这是被两种广泛部署的路由协议以及许多其他日常系统所使用,例如旨在找到通过道路网络的最短路径或发现社交网络中的连接和连接模式的软件。
Dijkstra’s Shortest Path First (SPF) algorithm is, perhaps, the most widely recognized and understood system for discovering loop-free paths through a network. It is used by two widely deployed routing protocols, and in many other everyday systems such as software designed to find the shortest path through a road network, or to discover connections and connection patterns in social networks.
Dijkstra 算法的伪代码使用两种数据结构。第一个是暂定清单,或 TENT;该列表包含考虑包含在最短路径树中的节点集。第二个是路径;该列表包含最短路径树上的节点集(因此也包含链接)。
Dijkstra’s algorithm, in pseudocode, uses two data structures. The first is the tentative list, or the TENT; this list contains the set of nodes under consideration for inclusion in the Shortest Path Tree. The second is the PATH; this list contains the set of nodes (and therefore links, as well), which are on the Shortest Path Tree.
01 将“我”移动到帐篷
02 当帐篷不为空时 {
03 对帐篷进行排序
04 选择 == 帐篷上的第一个节点
05 如果选择在路径中 {
06 *不执行任何操作*
07 }
08 else { 09
对于连接到 TOPO 中选定的每个节点,
将选定的节点添加到 PATH
10 11 v = 在 TENT 中查找节点
12 if (!v)
13 将节点移动到 TENT
14 else if node.cost < v .cost
15 将 v 替换为 TENT 上的节点
16 else
17 从 TOPO 中删除节点
18 }
19 }
01 move "me" to the TENT
02 while TENT is not empty {
03 sort TENT
04 selected == first node on TENT
05 if selected is in PATH {
06 *do nothing*
07 }
08 else {
09 add selected to PATH
10 for each node connected to selected in TOPO
11 v = find node in TENT
12 if (!v)
13 move node to TENT
14 else if node.cost < v.cost
15 replace v with node on TENT
16 else
17 remove node from TOPO
18 }
19 }
与往常一样,该算法不像最初检查时那样复杂;关键是两个列表的排序以及从 TENT 列表中处理节点的顺序。在浏览示例之前,以下是有关伪代码的一些注释:
As always, the algorithm is less complex than it appears on initial inspection; the key is the sorting of the two lists and the order in which nodes are processed off the TENT list. Here are some notes on the pseudocode before walking through an example:
1. 该过程从拓扑数据库的副本开始,这里称为 TOPO;这在示例中会更清楚,但它只是一个包含源节点、目标节点以及它们之间的链接成本的结构。
1. The process starts with a copy of the topology database, called TOPO here; this will be clearer in the example, but it is simply a structure containing the source nodes, the destination nodes, and the cost of the link between them.
2. TENT 是可以暂时被认为是到任何特定节点的最短路径的节点列表。
2. The TENT is the list of nodes that may, tentatively, be considered the shortest path to any particular node.
3. PATH 是最短路径树(SPT),一种包含到每个节点的无环路路径以及从“我”到该节点的下一跳的结构。
3. The PATH is the Shortest Path Tree (SPT), a structure containing a loop-free path to each node, and the next hop from “me” to that node.
4. 该算法的第一个关键点是仅保留已经以某种方式连接到 TENT 上 PATH 列表上的节点的节点;这意味着帐篷上的最短路径是网络中的下一个最短路径。
4. The first crucial point in this algorithm is keeping only nodes already somehow connected to a node on the PATH list on the TENT; this means the shortest path on the TENT is the next shortest path in the network.
5. 该算法的第二个关键点是TENT上连接到同一节点的任何现有节点之间的比较;这与TENT的排序以及TENT与PATH的分离相结合,执行最短路径规则。
5. The second crucial point in this algorithm is the comparison between any existing nodes on the TENT that connect to the same node; this, combined with the sorting of the TENT and the separation of the TENT from the PATH, executes the shortest path rule.
考虑到这些要点,图 13-1到13-9用于说明 Dijkstra SPF 算法的操作。
With these points in mind, Figures 13-1 through 13-9 are used to illustrate the operation of Dijkstra’s SPF algorithm.
图 13-1 用于演示 Dijkstra SPF 算法的小型网络
Figure 13-1 A Small Network for Demonstrating Dijkstra’s SPF Algorithm
下面的每一幅插图以及随附的描述都将显示该网络上 SPF 算法的一个步骤,从图 13-2开始。
Each of the following illustrations, along with the accompanying description, will show one step in the SPF algorithm on this network, beginning with Figure 13-2.
在图 13-2所示的点,A 已从 TOPO 移至 TENT,然后移至 PATH。源节点对自身的成本始终为0;包含此链接以启动 SPF 计算。这表示前面所示的伪代码中的第 01 行到第 09 行。图 13-3说明了 SPF 计算的第二步。
At the point illustrated in Figure 13-2, A has been moved from the TOPO into the TENT and then into the PATH. The cost of the origin node to itself is always 0; this link is included to start the SPF calculation. This represents lines 01 through 09 in the pseudocode shown earlier. Figure 13-3 illustrates the second step in the SPF calculation.
图13-3中,与A相连的各个节点已经从TOPO移动到了TENT;这代表前面所示伪代码中的第 10 行到第 17 行。当此步骤开始时,TENT 中只有 A,因此 TENT 中不存在会导致任何度量比较的现有节点。TENT 现在已排序,并继续执行伪代码中的第 03 行。图 13-4说明了这一点。
In Figure 13-3, each node connected to A has been moved from the TOPO to the TENT; this represents lines 10 through 17 in the pseudocode shown earlier. When this step began, there was only A in the TENT, so there are no existing nodes in the TENT that would have caused any metric comparisons. The TENT is now sorted, and execution continues with line 03 in the pseudocode. Figure 13-4 illustrates.
在图 13-4中,两条最短成本路径之一(到 B 和 F,每条成本均为 1)已被选择并移至 PATH(前面所示的伪代码中的第 05-09 行)。当 B 从 TENT 移动到 PATH 时,TOPO 中以 B 为起点的任何节点都会移动到 TENT(伪代码中的第 10-17 行)。注意,在通过 B 移动到 PATH 之前,C 尚未位于 TENT 中,因此不进行度量比较。到C的成本是其在PATH中的前一个节点(即B,成本为1)的成本与两个节点之间的链路的成本之和;因此,C 以 2 的成本添加到 TENT。TENT 已排序(伪代码的第 3 行),因此该过程已准备好再次开始。图 13-5说明了该过程的下一步。
In Figure 13-4, one of the two shortest cost paths—to B and F, each with a cost of 1—has been chosen and moved to the PATH (lines 05–09 in the pseudocode shown earlier). When B is moved from the TENT to the PATH, any nodes with an origin of B in the TOPO are moved to the TENT (lines 10–17 in the pseudocode). Note C was not already in the TENT before being drawn on through B’s move to the PATH, so no metric comparison is done. The cost to C is the sum of the cost of its predecessor in the PATH (which is B, with a cost of 1), and the link between the two nodes; hence C is added to the TENT with a cost of 2. The TENT is sorted (line 3 of the pseudocode), so the process is ready to begin again. Figure 13-5 illustrates the next step in the process.
在图13-5中,TENT上的最短路径已被选择,F从TENT移动到PATH。F和E之间有一条链接(在前面的插图中显示为[E,F]),但从F到E的路径与路径[A,E]的成本相同,因此该链接不会添加到TENT中。相反,它仍然呈灰色显示,因为不考虑包含在 SPT 中,并且已从 TOPO 中删除。图 13-6说明了该过程的下一步,该步骤会将度量为 2 的路径之一移至 PATH 中。
In Figure 13-5, the shortest path on the TENT has been chosen, and F moved from the TENT to the PATH. There is a link between F and E (shown in previous illustrations as [E,F]), but the path through F to E is the same cost as the path [A,E], so this link is not added to the TENT. Rather, it remains grayed out, as not being considered for inclusion in the SPT, and is removed from the TOPO. Figure 13-6 illustrates the next step in the process, which will move one of the metric 2 paths into the PATH.
大多数现实世界的实现支持将多个等成本路径从 TENT 传送到 PATH,因此它们可以使用相同的度量在所有链路上转发流量。这称为等价多路径或 ECMP。有多种不同的方法可以实现此目的,但这里不予介绍。
Most real-world implementations support carrying multiple equal cost paths from the TENT into the PATH, so they can forward traffic across all links with the same metric. This is called equal cost multipath, or ECMP. There are a number of different ways to accomplish this, but they are not covered here.
在图13-6中,通过B到C的路径(成本为2)已移至PATH,通过[A,B,C,D]到D的路径已移至TENT。然而,在将此路径移动到 TENT 时,伪代码中的第 11 行找到了 TENT 上到 D 的现有路径,即 [A,D] 路径,成本为 5。通过新路径 3 的度量较低大于通过现有路径的度量 5,因此当添加 [A,B,C,D] 路径时,[A,D] 路径将从 TENT 中删除(伪代码中的第 15 行)。图 13-7显示了下一步,剩余的成本 2 链路从 TENT 移至 PATH。
In Figure 13-6, the path to C through B, with a cost of 2, has been moved to the PATH, and the path to D through [A,B,C,D] has been moved to the TENT. In moving this path to the TENT, however, line 11 in the pseudocode finds an existing path to D on the TENT, the [A,D] path, with a cost of 5. The metric through the new path, 3, is lower than the metric through the existing path, 5, so the [A,D] path is removed from the TENT when the [A,B,C,D] path is added (line 15 in the pseudo-code). Figure 13-7 shows the next step, where the remaining cost 2 link is moved from the TENT to the PATH.
在图13-7中,到E的路径(成本为2)已从TENT移至PATH。G 已移动到帐篷,成本为 4([A,E] 和 [E,G] 之和)。E 的另一个邻居 F 已被探索,但它已经在 PATH 上,因此不考虑将其包含在 TENT 中。图 13-8说明了下一步,将 D 移动到 PATH 上。
In Figure 13-7, the path to E, with a cost of 2, has been moved from the TENT to the PATH. G has been moved to the TENT with a cost of 4 (the sum of [A,E] and [E,G]). E’s other neighbor, F, is explored, but it is already on the PATH, so it is not considered for inclusion in the TENT. Figure 13-8 illustrates the next step, which moves D onto the PATH.
在图13-8中,总成本为3的D已从TENT移至PATH。这将 D 的邻居 G(TOPO 中的最后一个条目)纳入 TENT 的考虑范围。然而,已经存在一条通过 [A,E,G] 到达 G 的路径,总成本为 4,因此伪代码中的第 14 行失败,路径 [D,G] 从 TOPO 中删除。这是最后的SPT。
In Figure 13-8, D, with a total cost of 3, has been moved from the TENT to the PATH. This brings D’s neighbor, G—the last entry in TOPO—into consideration for the TENT. However, there is already a path to G with a total cost of 4 through [A,E,G], so line 14 in the pseudocode fails, and the path [D,G] is removed from the TOPO. This is the final SPT.
理解 Dijkstra 算法的主要困难是最短路径规则不是在一个地方(或在一个路由器上)执行,不像贝尔曼-福特或扩散更新算法 (DUAL) 那样。(显然)仅在将节点从 TOPO 移动到 TENT 时才检查最短路径,但实际上,TENT 本身的排序执行最短路径规则的另一部分,并且检查现有节点的 PATH 构成了另一个步骤该过程分为三个步骤:
The primary difficulty in understanding Dijkstra’s algorithm is the shortest path rule isn’t executed in one place (or on one router), as it is with Bellman-Ford or the Diffusing Update Algorithm (DUAL). The shortest path is (apparently) checked only when moving nodes from the TOPO to the TENT—but in reality, the sorting of the TENT itself executes another portion of the shortest path rule, and checking against the PATH for existing nodes constitutes another step in the process, making the process three steps:
1. 如果到该节点的路径比 TENT 上的任何路径都长,则 TENT 上的路径是整个网络中较短的路径。
1. If the path to the node is longer than any on the TENT, then the one on the TENT is a shorter path across the entire network.
2. 通过排序上升到TENT顶部的路径是网络中到该节点的最短路径。
2. A path that has risen to the top of the TENT through sorting is the shortest to that node in the network.
3.如果路径从TENT顶部移动到PATH,则它是网络中到该节点的最短路径,并且TOPO中到该节点的任何其他条目都应该被丢弃。
3. If the path moves to the PATH from the top of the TENT, it is the shortest path to that node in the network, and any other entries in the TOPO to that node should be discarded.
基本算法就位后,查看一些优化以及无循环替代 (LFA) 和远程无循环替代 (rLFA) 的计算会很有用。
With the base algorithm in place, it is useful to look at some optimizations, and the calculation of Loop-Free Alternates (LFAs) and remote Loop-Free Alternates (rLFAs).
没有什么特别的原因,每次网络拓扑或可达性信息发生变化时都必须重建整个SPT;用图13-9进行说明。
There is no particular reason that the entire SPT must be rebuilt each time there is a change to the network topology or reachability information; Figure 13-9 is used to explain.
假设 G 失去与 2001:db8:3e8:100::/64 的连接;设备 A 不需要重新计算其到网络中任何节点的路径。可到达的目的地只是树上的一片叶子,即使它是连接到单线(例如以太网)的一组主机。当单个 SPT 被重新计算时,没有理由重新计算整个 SPT。叶子(或任何叶子集合)与网络断开连接。在这种情况下,仅叶(互联网协议 [IP] 地址或可到达的目的地)本身需要从网络中删除(或者更确切地说,可以从数据库中删除目的地而不对网络进行任何更改)。这是对 SPT 的部分重新计算。
Assume G loses its connection to 2001:db8:3e8:100::/64; device A does not need to recalculate its path to any of the nodes in the network. The reachable destination is just a leaf on the tree, even if it is a set of hosts connected to a single wire (such as an Ethernet). There is no reason to recalculate the entire SPT when a single leaf (or any set of leaves) is disconnected from the network. In this case, only the leaf (the Internet Protocol [IP] address or the reachable destination) itself would need to be removed from the network (or rather, the destination can be removed from the database without any change to the network). This is a partial recalculation of the SPT.
假设 [C,E] 链路发生故障。在这种情况下A会做什么?同样,C、B、D 的拓扑没有改变,因此 A 没有理由重新计算整棵树。在这种情况下,A 可以删除 E 之外的整个树。要仅计算图形的更改部分,请执行以下操作:
Assume the [C,E] link fails. What does A do in this case? Again, there is no change to the topology of C, B, and D, so there is no reason for A to recalculate the entire tree. It is possible, in this case, for A to remove the entire tree beyond E. To compute just the changed portion of the graph, do the following:
• 删除故障节点以及A 经过E 到达的所有节点。
• Remove the failed node and all nodes that A passes through E to reach.
• 仅从C 的前任(在本例中为A)重新计算树,以确定在[C,E] 链路失败之前是否存在到达先前可通过E 到达的节点的替代路径。
• Recalculate the tree just from C’s predecessor (in this case, A) to determine if there are alternate paths to reach nodes previously reachable through E before the [C,E] link failed.
这称为增量SPF。
This is called an incremental SPF.
第 12 章“单播无环路路径 (1) ”考虑了 LFA 和 rLFA 背后的理论。Bellman-Ford 不计算下游邻居或 LFA,并且似乎没有这样做所需的信息。DUAL 默认计算下游邻居并在收敛期间使用它们。基于 Dijkstra 的协议(以及类似的 SPF 算法)怎么样?图 13-10说明了这些协议可用来查找 LFA 和下游邻居的简单机制。
Chapter 12, “Unicast Loop-Free Paths (1),” considered the theory behind LFAs and rLFAs. Bellman-Ford does not calculate either downstream neighbors or LFAs, and does not appear to have the information required to do so. DUAL calculates downstream neighbors by default and uses them during convergence. What about protocols based on Dijkstra (and, by extension, similar SPF algorithms)? Figure 13-10 illustrates a simple mechanism that these protocols can use to find LFAs and downstream neighbors.
图 13-10 使用 Dijkstra 算法计算 LFA 和下游邻居
Figure 13-10 Calculating LFAs and Downstream Neighbors with Dijkstra’s Algorithm
下游邻居的定义是邻居到达目的地的成本小于到达目的地的本地成本。从A的角度来看:
The definition of a downstream neighbor is one where the neighbor’s cost to reach a destination is less than the local cost to reach the destination. From A’s perspective:
• 基于通过运行 Dijkstra 的 SPF 构建的 SPT,A 知道到达目的地的本地成本。
• A knows the local cost to reach the destination, based on the SPT built by running Dijkstra’s SPF.
• 通过从本地计算的成本中减去[A,B] 和[A,C] 链路的成本,A 知道B 和C 到达目的地的成本。
• A knows B’s and C’s cost to reach the destination, by subtracting the cost of the [A,B] and [A,C] links from the locally calculated cost.
因此,A 可以将本地成本与来自每个邻居的成本进行比较,以确定是否有任何邻居位于与任何特定目的地相关的下游。LFA 的定义是
Hence, A can compare the local cost with the cost from each neighbor to determine if any neighbor is downstream in relation to any particular destination. The definition of an LFA is
如果邻居到“我”的成本加上邻居到达目的地的成本低于本地成本,则该邻居是LFA。
If the neighbor’s cost to “me” plus the neighbor’s cost to reach the destination is lower than the local cost, the neighbor is an LFA.
或者更确切地说,给定
Or rather, given
• NC 是邻居到目的地的成本。
• NC is the neighbor’s cost to the destination.
• BC 是邻居对我的成本。
• BC is the neighbor’s cost to me.
• LC 是到达目的地的本地成本。
• LC is the local cost to the destination.
如果 NC + BC < LC,则邻居是 LFA。在这种情况下,A 从邻居的角度知道 [B,A] 和 [C,A] 链路的成本(它将包含在拓扑表中,尽管在使用 Dijkstra 算法计算 SPT 时不使用它) )。因此,LFA 和下游邻居只需很少的额外工作即可计算,但远程 LFA 又如何呢?P/Q 空间模型为基于 Dijkstra 的算法提供计算下游邻居和 LFA 的最简单方法。图 13-11用于从 P/Q 空间内部进行说明(参见第 12 章)。
If NC + BC < LC, then the neighbor is an LFA. In this case, A knows the cost of the [B,A] and [C,A] links from the perspective of the neighbor (it would be contained in the topology table, although it is not used in computing the SPT using Dijkstra’s algorithm). So LFAs and downstream neighbors require very little additional work to calculate, but what about remote LFAs? The P/Q Space model provides the simplest way for Dijkstra-based algorithms to compute downstream neighbors and LFAs. Figure 13-11 is used to illustrate from within the P/Q Space (see Chapter 12).
图 13-11 P/Q 空间并使用 Dijkstra 算法计算远程 LFA
Figure 13-11 P/Q Space and Calculating Remote LFAs with Dijkstra’s Algorithm
P空间的定义是从受保护链路一端可达的节点集合,Q空间的定义是不经过受保护链路可达的节点集合。这应该建议使用 Dijkstra 计算这两个空间的相当简单的方法:
The definition of the P space is the set of nodes reachable from one end of the protected link, and the definition of Q space is the set of nodes reachable without traversing the protected link. This should suggest a moderately simple way to calculate these two spaces using Dijkstra:
从链路一端连接的设备角度计算SPT;删除链路而不重新计算SPT。从链路的这一端可以到达其余节点。
Calculate an SPT from the perspective of the device connected to one end of the link; remove the link without recalculating the SPT. The remaining nodes are reachable from this end of the link.
在图13-11中,E可以
In Figure 13-11, E can
• 通过从本地SPT 的副本中删除[E,D] 链接以及E 使用D 到达的所有节点来计算Q 空间。
• Calculate the Q space by removing the [E,D] link from a copy of the local SPT, and all nodes that E uses D to reach.
• 通过从D 的角度计算SPT(使用D 作为树的根)来计算P 空间,删除[D,E] 链接,然后删除D 使用E 到达的所有节点。
• Calculate the P space by calculating an SPT from D’s perspective (using D as the root of the tree), removing the [D,E] link, and then all nodes that D uses E to reach.
• 查找从E 和D 均可到达且[E,D] 链接已删除的最近节点。
• Find the closest node reachable from both E and D with the [E,D] link removed.
Dijkstra 的 SPF 是一种通用的、广泛使用的算法,用于通过网络计算最短路径树。
Dijkstra’s SPF is a versatile, widely used algorithm for computing Shortest Path Trees through a network.
路径向量依赖于保存路径经过的节点列表。任何接收到自身位于路径中的更新的节点都会丢弃该更新,因为它不是可行的路径。以图13-12为例。
Path vector relies on keeping a list of the nodes through which a path passes. Any node that receives an update with itself in the path will just discard the update, as it is not a viable path. Figure 13-12 is used for an example.
图13-12中,各设备向各邻居设备通告目的地信息。对于附加到 E 的目的地:
In Figure 13-12, each device advertises information about destinations to each neighboring device; for the destination attached to E:
1. E 将在源中将 F 及其自身通告给 B 和 D,因此路径为 [E]。
1. E will advertise F with itself in the source, so with a path of [E], to both B and D.
2.从B:
2. From B:
B 将通过路径 [E,B] 将 F 通告给 A。
B will advertise F to A with a path of [E,B].
3.从D:
3. From D:
D 将通过路径 [E,D] 将 F 通告给 C。
D will advertise F to C with a path of [E,D].
4.从C:
4. From C:
C 将通过路径 [E,D,C] 将 F 通告给 A。
C will advertise F to A with a path of [E,D,C].
笔记
Note
路径向量并不是作为理论或算法开发的,而是作为协议开发的;在这方面,它在此处讨论的算法中是独一无二的。
Path vector was not developed as a theory or algorithm, but rather as a protocol; it is unique among the algorithms discussed here in this regard.
A更喜欢哪条路?在路径向量系统中,可以有许多度量,包括路径长度、策略偏好等。例如,假设每个路由携带的每个节点都有一个本地设置的度量。该本地度量在节点之间传送,但在通过网络时不会以任何方式求和,并且每个节点都可以独立于其他节点设置该度量(只要该节点对每个邻居使用相同的度量)。例如,E 的本地度量被通告给 B,然后 B 为该目的地设置自己的本地度量,并将结果路由通告给 A,等等。
Which path will A prefer? In a path vector system, there can be a number of metrics, including the length of the path, policy preferences, etc. For instance, assume there is a metric that is set locally at each node carried with each route. This local metric is carried between nodes but not summed in any way as it passes through the network, and each node can set this metric independently of the other nodes (so long as the node uses the same metric toward every neighbor). For instance, E’s local metric is advertised to B, which then sets its own local metric for this destination and advertises the resulting route to A, etc.
为了确定最佳路径,每个节点可以
To determine the best path, each node can then
•丢弃路径中具有本地节点标识符的任何目的地。
• Discard any destination with the local node identifier in the path.
•比较度量,从收到的度量中选择最高的本地度量。
• Compare the metric, choosing the highest local metric among those it has received.
•比较路径的长度,在收到的路径中选择最短的路径。
• Compare the length of the path, choosing the shortest path among those it has received.
• 仅通告用于转发流量的路径。
• Advertise only the path being used to forward traffic.
笔记
Note
每个节点选择最高或最低的度量并不重要;唯一重要的是整个网络中的每个节点都做同样的事情。然而,如果比较路径,节点必须始终选择较短的路径。
It does not matter if each node chooses the highest or the lowest metric; it only matters that each node does the same thing throughout the entire network. If comparing paths, however, the node must always choose the shorter path.
如果网络中的每个节点始终遵循这三个规则,则不会形成环路。例如:
If every node in the network always follows these three rules, no loop will form. For instance:
• E 将 F 通告给 B,路径为 [E],度量为 100。
• E advertises F to B with a path of [E] and a metric of 100.
• B 将 F 通告给 A,路径为 [E,B],度量为 100。
• B advertises F to A with a path of [E,B] and a metric of 100.
• E 将 F 通告给 D,路径为 [E],度量为 100。
• E advertises F to D with a path of [E] and a metric of 100.
• D 将 F 通告给 C,路径为 [E,D],度量为 100。
• D advertises F to C with a path of [E,D] and a metric of 100.
• C 将 F 通告给 A,路径为 [E,D,C],度量为 100。
• C advertises F to A with a path of [E,D,C] and a metric of 100.
A 有两条路径,均具有相同的度量,因此将使用第二条规则来选择一条,即较短的路径。在这种情况下,A将选择经过[E,B]的路径。A 将向 C 通告其正在使用的路由,但如果 C 遵循同一组规则,则它还将有两条可用度量为 100 的路径,一条路径为 [E,B,A],第二条路径为 [E,B,A]路径为 [E,D,C]。在这种情况下,C 内部必须使用一个仲裁器来在两条路由之间进行选择。这个决定因素是什么并不重要,只要它在节点内一致应用即可;无论C选择哪条路径,流向F的流量都不会环路。
A has two paths, both with the same metric, and hence will use the second rule to choose one, which is the shorter path. In this case, A will choose the path through [E,B]. A will advertise the route it is using toward C, but if C is following the same set of rules, it will also have two paths with a metric of 100 available, one with the path [E,B,A], and the second with a path of [E,D,C]. In this case, there must be a tie breaker that C uses internally to choose between the two routes. It isn’t important what this tie breaker is, so long as it is consistently applied within the node; no matter which path C chooses, the traffic toward F will not loop.
然而,假设情况略有不同:
Assume, however, a slightly different set of circumstances:
• E 将 F 通告给 B,路径为 [E],度量为 100。
• E advertises F to B with a path of [E] and a metric of 100.
• B 将 F 通告给 A,路径为 [E,B],度量为 100。
• B advertises F to A with a path of [E,B] and a metric of 100.
• E 将 F 通告给 D,路径为 [E],度量为 50。
• E advertises F to D with a path of [E] and a metric of 50.
• D 将 F 通告给 C,路径为 [E,D],度量为 50。
• D advertises F to C with a path of [E,D] and a metric of 50.
• C 将 F 通告给 A,路径为 [E,D,C],度量为 50。
• C advertises F to A with a path of [E,D,C] and a metric of 50.
A 有两条路径,一条路径的度量为 100,另一条路径的度量为 50。因此:
A has two paths, one with a metric of 100, and another with a metric of 50. Therefore:
• A 将选择两个度量中较高的一个,即通过 [E,B] 的路径,并将该路由通告给 C。
• A will choose the higher of the two metrics, the path through [E,B], and advertise this route to C.
• C 将选择两个度量中较高的一个,即通过 [E,B,A] 的路径,并将该路由通告给 D。
• C will choose the higher of the two metrics, the path through [E,B,A], and advertise this route to D.
• D 将选择两个度量中较高的一个,即通过 [E,B,A,C] 的路径,并将该路由通告给 E。
• D will choose the higher of the two metrics, the path through [E,B,A,C], and advertise this route to E.
• E 将丢弃该路由,因为E 本身已经在该路径中。
• E will discard this route, as E itself is already in the path.
因此,即使度量覆盖(几乎)每个节点的路径长度,也不会形成环路。
Hence, even if the metric overrides the path length at (almost) every node, no loop will form.
考虑一下由机器人跟随半个地球外科医生的手执行医疗程序的问题。使这样的系统工作可能需要从外科医生手上的传感器传递数据包近乎实时地按顺序发送给机器人,几乎没有或没有抖动,并且绝对不会丢失任何数据包。当然,这个例子可以扩展到许多不同的情况,包括金融系统和其他需要无故障的近实时数据包传送的机械控制系统。
Consider the problem of a medical procedure executed by a robot following the hands of a live surgeon halfway across the world. It is possible that making such a system work requires packets to be delivered from the sensors on the surgeon’s hands to the robot in near real time, in order, with little or no jitter, and absolutely no packets can be dropped. This example, of course, can be expanded to many different situations, including financial systems and other mechanical control systems where near-real-time packet delivery with no failures is required.
在这些情况下,通常需要的是传输每个数据包的两个副本,然后允许接收方选择最适合支持应用程序所需的服务质量 (QoS) 和数据包丢失特性的数据包。然而,到目前为止讨论的所有系统只能找到一条无环路径,并且可能找到一条备用路径(LFA 和/或 rLFA)。那么,通过不相交路径算法解决的问题是:
What is often needed in these situations is to transmit two copies of each packet and then allow the receiver to choose the packet best fitting the Quality of Service (QoS) and packet loss characteristics needed to support the application. All of the systems discussed so far, however, can find only one loop-free path, and potentially an alternate path (an LFA and/or an rLFA). The problem being solved, then, by disjoint path algorithms, is this:
如何通过网络构建路径,以确保它们使用尽可能少的重叠资源(设备和链路)(因此最大限度地不相交,或最大限度地冗余)?
How can paths be built through a network in such a way as to make certain they use the smallest number of overlapping resources (devices and links) as possible (hence are maximally disjoint, or maximally redundant)?
本节将首先描述双连接网络的概念,然后考虑计算双连接网络上不相交拓扑的两种不同(但看似相关)的方法。
This section will begin by describing the concept of a two-connected network, and then consider two different (but seemingly related) ways of calculating disjoint topologies on two-connected networks.
双连接网络是指源和目的地之间至少有两条不使用相同设备(节点)或链路(边缘)的路径的任何网络。这里有几点需要注意:
A two-connected network is any network in which there are at least two paths between a source and destination that do not use the same devices (nodes) or links (edges). There are points to pay attention to here:
• 网络对于一组特定的源和目的地而言是双连接的;大多数网络并不是每个源和每个目的地都是双连接的。
• A network is two-connected in relation to a specific set of sources and destinations; most networks are not two-connected for every source and every destination.
• 任何给定网络的小块对于某些源和目的地可能是双连接的,并且这些块可能通过狭窄的单连接或双连接阻塞点互连。
• Small blocks of any given network may be two-connected for some sources and destinations, and these blocks may be interconnected by narrow one- or two-connected choke points.
笔记
Note
瓶颈将在网络设计的许多不同领域发挥重要作用,这是第三部分“网络设计”中考虑的主题。
Choke points will play a major role in many different areas of network design, a topic considered in Part III, “Network Design.”
通过实际例子来理解二元连通性通常是最容易的;图 13-13显示了以块标记的网络。
It is often easiest to understand two-connectedness through an actual example; Figure 13-13 shows a network marked out in blocks.
In block A, there are at least two different disjoint paths between X and F:
• [X,A,B,E,F] 和 [X,C,F]
• [X,A,B,E,F] and [X,C,F]
• [X,A,B,F] 和 [X,C,F]
• [X,A,B,F] and [X,C,F]
在块 B中,有一对从 G 到 L 的不相交路径:[G,K,L] 和 [G,H,L]。不存在到 Z 的不相交路径,因为该节点是单连接的。F 和 G 之间也不存在不相交的路径,因为这两个路径是单向连接的。[F,G] 链路可以被视为这两个拓扑块之间的阻塞点。在图 13-13所示的网络中,不可能计算 X 和 Z 之间的两条不相交路径。
In block B, there is one pair of disjoint paths from G to L: [G,K,L] and [G,H,L]. There are no disjoint paths to Z, as this node is single connected. There are also no disjoint paths between F and G, as these two are single connected. The [F,G] link can be considered a choke point between these two topology blocks. It is not possible, in the network illustrated in Figure 13-13, to compute two disjoint paths between X and Z.
1974 年,JW Suurballe 发表了一篇论文,描述了如何使用多次运行 Dijkstra 的 SPF 算法来查找网络中的多个不相交拓扑。2该算法本质上计算 SPF 一次,删除 SPT 上使用的链路子集,然后在剩余链路上计算第二个 SPF。Suurballe 的算法比在示例中说明更难解释,因为它依赖于通过 SPT 计算的链接的方向性质;示例如图13-14至图13-18所示。
In 1974, J. W. Suurballe published a paper describing how to use multiple runs of Dijkstra’s SPF algorithm to find multiple disjoint topologies in a network.2 The algorithm essentially computes SPF once, removes a subset of the links in use on the SPT, and then computes a second SPF across the remaining links. Suurballe’s algorithm is harder to explain than to illustrate in an example because of its reliance on the directional nature of the links computed through SPT; Figure 13-14 through Figure 13-18 are used as examples.
图 13-14 使用 Suurballe 算法查找不相交路径,步骤 1
Figure 13-14 Using Suurballe’s Algorithm for Finding Disjoint Paths, Step 1
图 13-15 使用 Suurballe 算法查找不相交路径,步骤 2
Figure 13-15 Using Suurballe’s Algorithm for Finding Disjoint Paths, Step 2
图 13-16 使用 Suurballe 算法查找不相交路径,步骤 3
Figure 13-16 Using Suurballe’s Algorithm for Finding Disjoint Paths, Step 3
图 13-17 使用 Suurballe 算法查找不相交路径,步骤 4
Figure 13-17 Using Suurballe’s Algorithm for Finding Disjoint Paths, Step 4
图 13-14显示了第一次 SPF 运行完成并计算初始 SPT 后的操作状态。注意链接上的方向箭头;这是通常认为 SPT 是有方向的,但实际上确实如此,每个链接都远离源或树的根。当 F 计算一棵返回 X 的树时,它也会生成一棵箭头指向相反方向的有向树。
Figure 13-14 shows the state of the operations after the first SPF run has completed and the initial SPT is computed. Note the directional arrows on the links; it is not common to think about an SPT as being directional, but in reality it is, with each link oriented away from the source, or the root of the tree. When F computes a tree back toward X, it would also produce a directional tree with the arrows pointing in the opposite direction.
SPT 上的边(或链接)称为树边,不在结果 SPT 上的边(或链接)称为非树边。在图13-14中,树边缘用带有方向箭头的实心黑色标记,非树边缘是浅灰色虚线。
Edges (or links) on the SPT are called tree edges, and edges (or links) not on the resulting SPT are called nontree edges. In Figure 13-14, the tree edges are marked in solid black with directional arrows, and the nontree edges are lighter gray dashed lines.
第二步如图13-15所示。
The second step is shown in Figure 13-15.
图13-15显示了修改成本后的每个链路;作为原始 SPT 一部分的每个链接(每个树边缘,显示为实线)都有两个成本,每个方向一个,而最初不是 SPT 一部分的链接(非树边缘,显示为虚线)则具有其原始成本成本。请注意显示每种情况下成本方向的箭头;这对于下一阶段的计算很重要。要计算每个树边的两个定向链路的成本:
Figure 13-15 shows each link with modified costs; each link that was a part of the original SPT (each tree edge, shown as a solid line) has two costs, one in each direction, while links not originally part of the SPT (nontree edges, shown as dashed lines) have their original costs. Note the arrows showing the direction of the cost in each case; this will be important in the next stage of the calculation. To calculate the costs of the two directional links for each tree edge:
1. 调用链路一端u,另一端调用链路v;请注意,该方程是双向运行的。
1. Call one end of the link u and the other end of the link v; note the equation is being run in both directions.
2. 从从u到v的链路成本中减去从源到v的成本。
2. Subtract the cost from the source to v from the cost of the link from u to v.
3. 将源中的成本添加到u。
3. Add the cost from the source to u.
如果源是s:
If the source is s:
d[sp](u,v) = d(u,v) − d(s,v) + d(s,u)
d[sp](u,v) = d(u,v) − d(s,v) + d(s,u)
这实质上将树边的成本设置为 0,通过对 [B,E] 链接进行数学计算可以看出:
This essentially sets the cost of tree edges to 0, as can be seen by doing the math for the [B,E] link:
• B 是u,E 是v,A 是s
• B is u, E is v, A is s
• d( u,v ) = 2, d( s,v ) = 3, d( s,u ) = 1
• d(u,v) = 2, d(s,v) = 3, d(s,u) = 1
• 2 − 3 + 1 = 0
• 2 − 3 + 1 = 0
然而,所有非树边都将被设置为一些(通常更大)非零成本。对于图13-15中的网络:
All of the nontree edges, however, will be set to some (generally larger) nonzero cost. For the network in Figure 13-15:
• 对于[B,A] 链接(注意[A,B] 不是正在计算的有向树中的链接):
• For the [B,A] link (note [A,B] is not a link in the directional tree being calculated):
B 是u,A 是v,A 是s
B is u, A is v, A is s
d( u,v ) = 0, d( s,v ) = 0, d( s,u ) = 1
d(u,v) = 0, d(s,v) = 0, d(s,u) = 1
0 − 0 + 1 = 1
0 − 0 + 1 = 1
E 是u,B 是v,A 是s
E is u, B is v, A is s
d( u,v ) = 2, d( s,v ) = 1, d( s,u ) = 3
d(u,v) = 2, d(s,v) = 1, d(s,u) = 3
2 − 1 + 3 = 4
2 − 1 + 3 = 4
• 对于[C,A] 链接:
• For the [C,A] link:
C 是u , A 是v , A 是s
C is u, A is v, A is s
d( u,v ) = 2, d( s,v ) = 0, d( s,u ) = 2
d(u,v) = 2, d(s,v) = 0, d(s,u) = 2
2 − 0 + 2 = 4
2 − 0 + 2 = 4
• 对于[F,D] 链接:
• For the [F,D] link:
F 是u,D 是v,A 是s
F is u, D is v, A is s
d( u,v ) = 1, d( s,v ) = 4, d( s,u ) = 5
d(u,v) = 1, d(s,v) = 4, d(s,u) = 5
1 − 4 + 5 = 2
1 − 4 + 5 = 2
• 对于[D,B] 链接:
• For the [D,B] link:
D 是u,B 是v,A 是s
D is u, B is v, A is s
d( u,v ) = 1, d( s,v ) = 1, d( s,u ) = 2
d(u,v) = 1, d(s,v) = 1, d(s,u) = 2
1 − 1 + 2 = 2
1 − 1 + 2 = 2
如图 13-16所示,下一步是删除沿着原始 SPT 朝向特定目的地(在本例中为 Z)的所有指向源的有向边,反转零成本边的方向(链路),然后再次运行 Dijkstra 的 SPF,在同一拓扑上创建第二个 SPT。
The next step, shown in Figure 13-16, is to remove all the directional edges pointing toward the source that lies along the original SPT toward the specific destination (Z, in this case), reverse the direction of the zero-cost edges (links) along this same path, and then run Dijkstra’s SPF again, creating a second SPT on the same topology.
回到原来的SPT,从X到Z的路径是沿着路径[A,B,D,F]。因此,沿该路径指向源 A 的四个非零成本边(虚线)已被删除。沿着同一条路径[A,B,D,F],每条边的方向都颠倒了;例如,[A,B] 最初从 A 指向 B,现在从 B 指向 A。下一步是在该图上运行 SPF,记住流量不能逆着链路方向流动。生成的树如图 13-17所示。
Returning to the original SPT, the path from X to Z was along the path [A,B,D,F]. Hence, the four nonzero-cost edges (the dashed lines) pointing back toward the source, A, along this path have been removed. Along the same path, [A,B,D,F], the direction of each edge has been reversed; for instance, [A,B] originally pointed from A toward B and now points from B toward A. The next step is to run SPF across this graph, remembering traffic cannot flow against the direction of the link. The resulting tree is shown in Figure 13-17.
图 13-17将原始树和新计算的树显示为两条不同的虚线,叠加在原始拓扑上。这两种拓扑仍然共享 [B,D] 链路,因此它们还没有真正分离。此时,从X到Z有两条最短路径:
Figure 13-17 shows the original tree and the newly calculated tree overlaid on the original topology as two different dashed lines. The two topologies still share the [B,D] link in common, so they are not truly disjoint yet. At this point, there are two shortest paths from X to Z:
• [A、C、D、B、E、F]
• [A,C,D,B,E,F]
这两个图被合并形成一组边,并且两个图中都包含但方向相反的任何链接都将被丢弃;组合后的集合如下所示:
These two graphs are merged to form a set of edges, and any links that are included in both graphs, but in opposite directions, are discarded; the combined set looks like this:
[A->B、B->E、E->F、A->C、C->D、D->F]
[A->B, B->E, E->F, A->C, C->D, D->F]
再次注意每个链接的方向性 - 删除重叠链接至关重要,重叠链接将同时列为 [B->D] 和 [D->B]。通过图上可能边的子集,可以看到正确的最短路径集是 [A,B,E,F] 和 [A,C,D,F]。
Note the directionality of each link again—it is crucial to paring out the overlapping link, which would be listed both as [B->D] and [D->B]. With this subset of possible edges on the graph, it is possible to see the correct set of shortest paths are [A,B,E,F] and [A,C,D,F].
Suurballe 的算法很复杂,但显示了计算不相交树的要点 - 包括它们的计算难度。
Suurballe’s algorithm is complex, but shows the principal points of calculating disjoint trees—including how difficult they are to compute.
计算不相交树的 Suurballe 算法的一个更简单的替代方案是计算最大冗余树 (MRT)。了解 MRT 的最佳起点是简单的深度优先搜索 (DFS),尤其是编号的 DFS。如图13-18所示。
A simpler alternative to Suurballe’s algorithm to calculate disjoint trees is computing Maximally Redundant Trees (MRTs). The best place to begin in understanding MRTs is with the humble Depth First Search (DFS), particularly the numbered DFS. Figure 13-18 is used as an illustration.
在图13-18中,左边表示一个简单的拓扑;右侧,与使用 DFS 编号的拓扑相同。假设用于“行走”树的 DFS 算法始终选择左侧节点而不是右侧节点,则该过程将如下所示:
In Figure 13-18, the left side represents a simple topology; the right, the same topology that has been numbered using a DFS. Assuming the DFS algorithm used to “walk” the tree always chooses the left node over the right, the process would look something like this:
01 main {
02 dfs_number = 1
03 root.number = dfs_number
04 recurse_dfs(root)
05 }
06 recurse_dfs(current) {
07 对于当前的每个邻居 {
08 child = 最左边的邻居(未访问过)
09 if child.number == 0 {
10 dfs_number++
11 child.number = dfs_number
12 if child.children > 0 {
13 recurse_dfs(child)
14 }
15 }
16 }
17 }
01 main {
02 dfs_number = 1
03 root.number = dfs_number
04 recurse_dfs(root)
05 }
06 recurse_dfs(current) {
07 for each neighbor of current {
08 child = left most neighbor (not visited)
09 if child.number == 0 {
10 dfs_number++
11 child.number = dfs_number
12 if child.children > 0 {
13 recurse_dfs(child)
14 }
15 }
16 }
17 }
理解这段代码的最好方法是多次执行递归,看看它是如何工作的。使用图 13-18:
The best way to understand this code is to walk through the recursion a few times to see how it works. Using Figure 13-18:
• 在第一次调用recurse_dfs时,A 或root被设置为当前节点。
• In the first call into recurse_dfs, A, or root, is set as the current node.
• 一旦进入recurse_dfs,就会选择 A 的最左边的节点,或者 B。
• Once inside recurse_dfs, the leftmost node of A is chosen, or B.
•进入循环时B 没有数字,因此第09 行的if语句为true。
• B does not have a number when the loop is entered, so the if statement on line 09 is true.
• B 被分配下一个DFS 编号(第11 行)。
• B is assigned the next DFS number (line 11).
• B 有子节点(第 12 行),因此以 B 作为当前节点再次调用recurse_dfs 。
• B has children (line 12), so recurse_dfs is called again with B as the current node.
• 一旦进入recurse_dfs(第二层),就会选择B 的最左邻居,即E。
• Once inside the (second level of) recurse_dfs, the leftmost neighbor of B is chosen, which is E.
• E 没有 DFS 编号,因此第 09 行的if语句为 true。
• E does not have a DFS number, so the if statement on line 09 is true.
• E 被指定为下一个DFS 编号(3)。
• E is assigned the next DFS number (3).
• E does not have children, so the processing winds back to the top of the loop.
• F 现在是B 尚未访问过的最左邻居,因此它被分配给child。
• F is now the leftmost neighbor of B that has not been visited, so it is assigned to child.
• F 没有数字,因此第 09 行的if语句为 true。
• F does not have a number, so the if statement on line 09 is true.
• F 被指定为下一个DFS 编号(4)。
• F is assigned the next DFS number (4).
• B 不再有子级,因此第 07 行的for循环失败,并且recurse_dfs退出。
• B has no more children, so the for loop at line 07 fails, and the recurse_dfs exits.
• 然而,recurse_dfs实际上并不退出——它只是“回退”到前一个递归级别,即第 14 行;这一级递归仍在处理 A 的邻居。
• However, recurse_dfs does not actually exit—it just “falls back” to the previous recursion level, which is line 14; this level of recursion is still processing A’s neighbors.
• C 是 A 的下一个未被触及的邻居,因此child设置为 C。
• C is the next neighbor of A that has not been touched, so child is set to C.
• 等等。
• And so on.
检查图 13-18右侧的节点数量会得到以下有趣的结果:
Examining the numbers of the nodes on the right side of Figure 13-18 leads to the following interesting observations:
• 如果A 始终遵循递增的数字到达D,则它将遵循路径[A,C,G,D]。
• If A always follows an increasing number to reach D, it will follow the path [A,C,G,D].
• 如果D 始终遵循递减的DFS 编号到达A,则它将遵循路径[D,A]。
• If D always follows a decreasing DFS number to reach A, it will follow the path [D,A].
• 事实上,这两条路径是不相交的。
• These two paths are, in fact, disjoint.
此属性适用于通过 DFS 搜索分配了编号的所有拓扑:遵循始终递增编号的路径将始终与始终遵循递减编号的路径不相交。这正是捷运所依赖的属性来构建不相交的路径。然而,DFS 编号的问题是很难近乎实时地进行。必须有某种选举出的根,流量在本地级别不是最优的(可能很像最小生成树或 MST),并且对拓扑的任何更改都需要重建整个 DFS 编号方案。
This property holds for all topologies that have been assigned numbers through a DFS search: a path that follows always-increasing numbers will always be disjoint with a path that always follows decreasing numbers. This is precisely the property MRTs rely on to build disjoint paths. The problem with DFS numbering, however, is it is difficult to do in near real time. There must be some sort of elected root, traffic is suboptimal at a local level (much like a Minimum Spanning Tree, or MST, might be), and any changes to the topology require the entire DFS numbering scheme to be rebuilt.
为了解决这些问题,MRT 使用相同的原理但以不同的方式构建不相交的拓扑。用图13-19进行说明。
To work around these problems, MRT builds disjoint topologies using the same principle but in a different way. Figure 13-19 is used to explain.
构建 MRT 的第一步是从根开始查找拓扑中的短环路(通常使用 Dijkstra 的 SPF 算法找到这些环路)。在这种情况下,将选择 A 作为根,循环将为 [A,B,C,D]。第一个环路将用作两个拓扑中的第一个,例如红色拓扑。将循环反转为 [A,D,C,B] 会生成不相交的拓扑,例如蓝色拓扑。通过这个短环路的第一对拓扑称为耳朵。
The first step in building an MRT is to find a short loop through the topology from a root (generally these loops are found using Dijkstra’s SPF algorithm). In this case, A will be chosen as the root, and the loop will be [A,B,C,D]. This first loop will be used as the first of the two topologies, say the red topology. Reversing the loop to [A,D,C,B] generates a disjoint topology, say the blue topology. This first pair of topologies through this short loop is called an ear.
为了扩大 MRT 的范围,在第一只耳朵的基础上添加了第二只耳朵。为此,发现了第二个环路,这次是通过 [A,D,F,E,B],并且是不相交的拓扑是[A,B,E,F,D]。问题是:这两个拓扑扩展中的哪一个应该添加到红色拓扑中,哪一个应该添加到蓝色拓扑中?这就是 DFS 编号形式发挥作用的地方。
To expand the range of the MRT, a second ear is added to the first. To do this, a second loop is discovered, this time through [A,D,F,E,B], and the disjoint topology is [A,B,E,F,D]. The question is: which of these two topology extensions should be added to the red topology, and which should be added to the blue? This is where a form of DFS numbering comes into play.
网络中的每个设备必须已经有一个由管理员或通过某种其他机制分配的标识符。这些标识符对于每个设备必须是唯一的。在 DFS 编号方案中,还存在低点的概念,它指示该节点附加到特定树上的位置,以及哪些节点通过该节点附加到该树。
Each device in the network must already have an identifier assigned, either by the administrator, or through some other mechanism. These identifiers must be unique per device. Within the DFS numbering scheme there is also the concept of a low point, which indicates where on a particular tree this node attaches, and also what nodes attach to the tree through this node.
有了这些唯一的标识符和计算低点的能力,网络中的每个节点都可以被排序,就像通过 DFS 编号过程给它一个编号一样。关键是要知道排序如何与现有的红色和蓝色拓扑相对应。如果 [A,B,C,D] 拓扑是红色拓扑的一部分,则假设 B 的低点高于 C 的低点。对于拓扑中任何其他经过 B 和 C 的耳或环路, B 小于 C 的耳的方向应放置在红色拓扑上。相反方向的环路应放置在蓝色拓扑上。
Given these unique identifiers and the ability to calculate a low point, each node in the network can be ordered just like it were given a number through a DFS numbering process. The key is to know how the ordering corresponds to the existing red and blue topologies. Assume B’s low point is higher than C’s, if the [A,B,C,D] topology is part of the red topology. For any other ear or loop in the topology, which passes through B and C, the direction of the ear in which B is less than C should be placed on the red topology. The loop in the opposite direction should be place on the blue topology.
这个解释相当粗略,但它确实让您了解 MRT 如何形成不相交的拓扑。有关 MRT 及其构造的更多信息,请参阅本章末尾的“进一步阅读”部分。
This explanation is rather cursory, but it does give you the sense of how MRTs form disjoint topologies. Refer to the “Further Reading” section at the end of this chapter for more information on MRTs and their construction.
本章和前一章描述了计算网络中无环路路径(或一组不相交路径)的多种不同方法。在每种情况下,计算的路径都是单向的——从树的根到边缘或可到达的目的地。事实上,没有返回路径是可能存在的。换句话说,源可能能够沿着无环路路径到达目的地,但可能不存在从目的地到源的返回路径。这可能是某些链路类型中不常见的故障模式、过滤可达性信息的结果或网络中的许多其他情况。
This chapter and the preceding one have described a number of different ways to compute a loop-free path (or a set of disjoint paths) through a network. In each of these cases, the path computed is unidirectional—from the root of the tree to the edges, or reachable destinations. It is, in fact, possible, for no return path to exist. In other words, a source may be able to reach a destination along a loop-free path, but there may be no return path from the destination to the source. This can be an uncommon failure mode in some link types, a result of filtering reachability information, or a number of other situations in the network.
笔记
Note
双向连接并不总是需要的;例如,考虑一艘潜艇的情况,它需要接收有关其当前任务的信息,但在不泄露其当前位置的情况下无法传输任何信息。即使没有与潜艇上的双向连接,也能够将数据包发送到位于潜艇上的设备,这将是令人期望的。控制平面必须进行修改或专门设计来处理这种不常见的情况,因为常见的情况是正确的网络操作需要双向连接。
Two-way connectivity is not always desired; consider the case of a submarine, for instance, that needs to receive information about its current mission but cannot transmit any information without revealing its current position. The ability to send packets to devices located on the submarine, even though there is no twoway connectivity to them, would be desirable. Control planes either must be modified or specially designed to handle this kind of uncommon case, as the common case is for two-way connectivity to be required for proper network operation.
控制平面在计算路径领域必须应对的另一个问题是确保存在端到端的双向连接。
One other problem control planes must contend with in the area of computing paths is ensuring end-to-end two-way connectivity exists.
控制平面可以通过多种方式解决这个问题:
There are a number of ways a control plane can solve this problem:
• 某些控制平面会忽略此问题,这意味着它们假设某些其他协议(例如传输协议)将检测到此情况。
• Some control planes just ignore this problem, which means they assume some other protocol, such as a transport protocol, will detect this condition.
• 控制平面可以在路由计算时检查此问题。例如,当使用 Dijkstra 算法计算路由时,可以在计算无环路径的同时执行反向链路检查。在计算的每个步骤中执行反向链路检查可以确保存在双向连接。
• The control plane can check for this problem during route calculation. It is possible, for instance, when calculating routes using Dijkstra’s algorithm, to perform a back link check while computing loop-free paths. Performing this back link check at each step of the computation can ensure two-way connectivity exists.
• 控制平面可以假设邻居之间的双向连接,确保端到端的双向连接。在每个邻居的基础上执行显式双向连接检查的控制平面可以(通常)安全地假设通过这些邻居的任何路径也能够进行双向通信。
• The control plane can assume two-way connectivity between neighbors ensures end-to-end two-way connectivity. Control planes that perform explicit two-way connectivity checks on a per neighbor basis can (generally) safely assume any path through those neighbors is also capable of two-way communications.
这两章涵盖了很多内容,首先是最短路径规则及其在计算网络无环路路径过程中的重要性。接下来讨论的是贝尔曼-福特的原始形式,然后是加西亚的 DUAL。基于这两个协议构建的路由协议被视为距离矢量协议,您将在后续章节中遇到该术语。接下来考虑 Dijkstra 的 SPF;基于此算法构建的协议被视为链路状态。然后讨论路径向量解,最后讨论不相交路径。
These two chapters have covered a lot of ground, beginning with the shortest path rule and its importance in the process of computing loop-free paths through a network. Bellman-Ford, in its original form, was discussed next, then Garcia’s DUAL. Routing protocols built on these two protocols are considered distance-vector protocols, a term you will encounter in following chapters. Dijkstra’s SPF was considered next; protocols built on this algorithm are considered link state. Then the path-vector solution was discussed, and finally disjoint paths.
这些算法中的大多数都可以由分布式控制平面或集中式控制平面使用。主要的一点是了解如何解决无环路路径问题,这样无论您正在查看什么协议或控制器,您都可以识别它的多种形式并了解它是如何解决的。
Most of these algorithms can be used either by a distributed control plane or a centralized one. The primary point is to know how the loop-free path problem can be solved, so you can recognize it in its many forms and understand how it is being solved, no matter what protocol or controller you are looking at.
钱德拉、拉维和约翰·斯卡德。使用 BGP-4 进行功能通告。征求意见 5492。RFC 编辑,2009。doi:10.17487/rfc5492。
Chandra, Ravi, and John Scudder. Capabilities Advertisement with BGP-4. Request for Comments 5492. RFC Editor, 2009. doi:10.17487/rfc5492.
陈恩克、托尼·J·贝茨和拉维·钱德拉。BGP 路由反射:全网状内部 BGP (IBGP) 的替代方案。征求意见 4456。RFC 编辑,2006。doi:10.17487/rfc4456。
Chen, Enke, Tony J. Bates, and Ravi Chandra. BGP Route Reflection: An Alternative to Full Mesh Internal BGP (IBGP). Request for Comments 4456. RFC Editor, 2006. doi:10.17487/rfc4456.
陈恩克、约翰·斯卡德、阿尔瓦罗·雷塔纳和丹尼尔·沃尔顿。BGP 中多路径的通告。征求意见 7911。RFC 编辑,2016。doi:10.17487/rfc7911。
Chen, Enke, John Scudder, Alvaro Retana, and Daniel Walton. Advertisement of Multiple Paths in BGP. Request for Comments 7911. RFC Editor, 2016. doi:10.17487/rfc7911.
陈恩克和奎扎尔·沃拉。BGP 支持四个八位字节的 AS 编号空间。征求意见 4893。RFC 编辑,2007。doi:10.17487/rfc4893。
Chen, Enke, and Quaizar Vohra. BGP Support for Four-Octet AS Number Space. Request for Comments 4893. RFC Editor, 2007. doi:10.17487/rfc4893.
Chunduri、Uma、Wenhu Lu、Albert Tian 和 Naiming Shen。IS-IS 扩展序列号 TLV。征求意见 7602。RFC 编辑,2015。doi:10.17487/rfc7602。
Chunduri, Uma, Wenhu Lu, Albert Tian, and Naiming Shen. IS-IS Extended Sequence Number TLV. Request for Comments 7602. RFC Editor, 2015. doi:10.17487/rfc7602.
Dijkstra,EW“关于与图相关的两个问题的注释”。数值数学1,no。1(1959):269-71。doi:10.1007/BF01386390。
Dijkstra, E. W. “A Note on Two Problems in Connexion with Graphs.” Numerische Mathematik 1, no. 1 (1959): 269–71. doi:10.1007/BF01386390.
多伊尔、杰夫和詹妮弗·德黑文·卡罗尔。路由 TCP/IP,第 1 卷。第二版。印度新德里:思科出版社,2005 年。
Doyle, Jeff, and Jennifer DeHaven Carroll. Routing TCP/IP, Volume 1. 2nd edition. New Delhi, India: Cisco Press, 2005.
弗格森、丹尼斯、阿西·林德姆和约翰·莫伊。用于 IPv6 的 OSPF。征求意见 5340。RFC 编辑,2008。doi:10.17487/rfc5340。
Ferguson, Dennis, Acee Lindem, and John Moy. OSPF for IPv6. Request for Comments 5340. RFC Editor, 2008. doi:10.17487/rfc5340.
金斯伯格、莱斯、斯蒂芬·利特科斯基和斯特凡诺·普雷维迪。用于扩展 IP 和 IPv6 可达性的 IS-IS 路由首选项。征求意见 7775。RFC 编辑,2016。doi:10.17487/rfc7775。
Ginsberg, Les, Stephane Litkowski, and Stefano Previdi. IS-IS Route Preference for Extended IP and IPv6 Reachability. Request for Comments 7775. RFC Editor, 2016. doi:10.17487/rfc7775.
海茨、雅各布、基尤尔·帕特尔、乔布·斯奈德斯、伊格纳斯·巴格多纳斯和尼克·希利亚德。“BGP 大型社区。” 互联网草案。互联网工程任务组,2017 年 1 月。https ://tools.ietf.org/html/draft-ietf-idr-large-community-12。
Heitz, Jakob, Keyur Patel, Job Snijders, Ignas Bagdonas, and Nick Hilliard. “BGP Large Communities.” Internet-Draft. Internet Engineering Task Force, January 2017. https://tools.ietf.org/html/draft-ietf-idr-large-community-12.
“与提供无连接模式网络服务的协议结合使用的中间系统到中间系统域内路由信息交换协议。” 标准。日内瓦:国际标准化组织,2002 年。http: //standards.iso.org/ittf/PubliclyAvailableStandards/。
“Intermediate System to Intermediate System Intra-Domain Routing Information Exchange Protocol for Use in Conjunction with the Protocol for Providing the Connectionless-Mode Network Service.” Standard. Geneva: International Organization for Standardization, 2002. http://standards.iso.org/ittf/PubliclyAvailableStandards/.
卡茨、戴夫. “OSPF 和 IS-IS:比较剖析。” 于 2000 年 6 月 12 日在新墨西哥州阿尔伯克基举行的 NANOG19 上发表。https://nanog.org/meetings/abstract?id=1084。
Katz, Dave. “OSPF and IS-IS: A Comparative Anatomy.” Presented at the NANOG19, Albuquerque, NM, June 12, 2000. https://nanog.org/meetings/abstract?id=1084.
麦克弗森、丹尼·R.和基尤尔·帕特尔。拥有 BGP-4 协议的经验。征求意见 4277。RFC 编辑,2006。doi:10.17487/rfc4277。
McPherson, Danny R., and Keyur Patel. Experience with the BGP-4 Protocol. Request for Comments 4277. RFC Editor, 2006. doi:10.17487/rfc4277.
迈耶、大卫和基尤·帕特尔。BGP-4 协议分析。征求意见 4274。RFC 编辑,2006。doi:10.17487/rfc4274。
Meyer, David, and Keyur Patel. BGP-4 Protocol Analysis. Request for Comments 4274. RFC Editor, 2006. doi:10.17487/rfc4274.
Mirtorabi、Sina、Abhay Roy、Acee Lindem 和 Fred Baker。“OSPFv3 LSA 可扩展性。” 互联网草案。互联网工程任务组,2016 年 10 月。https ://tools.ietf.org/html/draft-ietf-ospf-ospfv3-lsa-extend-13。
Mirtorabi, Sina, Abhay Roy, Acee Lindem, and Fred Baker. “OSPFv3 LSA Extendibility.” Internet-Draft. Internet Engineering Task Force, October 2016. https://tools.ietf.org/html/draft-ietf-ospf-ospfv3-lsa-extend-13.
莫伊、约翰. “OSPF 版本 2。” 征求意见 2328。RFC 编辑,1998 年 4 月。doi:10.17487/RFC2328。
Moy, John. “OSPF Version 2.” Request for Comments 2328. RFC Editor, April 1998. doi:10.17487/RFC2328.
帕克、杰夫. 使用中间系统到中间系统 (IS-IS) 的互操作网络的建议。征求意见 3719。RFC 编辑,2004 年。doi:10.17487/rfc3719。
Parker, Jeff. Recommendations for Interoperable Networks Using Intermediate System to Intermediate System (IS-IS). Request for Comments 3719. RFC Editor, 2004. doi:10.17487/rfc3719.
Przygienda,Antoni B 博士。中间系统到中间系统 (ISIS) 中的可选校验和。征求意见 3358。RFC 编辑,2002。doi:10.17487/rfc3358。
Przygienda, Dr. Antoni B. Optional Checksums in Intermediate System to Intermediate System (ISIS). Request for Comments 3358. RFC Editor, 2002. doi:10.17487/rfc3358.
Ramachandra、Srihari S. 和 Yakov Rekhter。BGP 扩展社区属性。征求意见 4360。RFC 编辑,2006。doi:10.17487/rfc4360。
Ramachandra, Srihari S., and Yakov Rekhter. BGP Extended Communities Attribute. Request for Comments 4360. RFC Editor, 2006. doi:10.17487/rfc4360.
拉祖克、罗伯特、克里斯蒂安·卡萨尔、布鲁诺·德克莱恩、斯蒂芬·利特科斯基、凯文·王和埃里克·阿曼。“BGP 最佳路由反射 (BGP-ORR)。” 互联网草案。互联网工程任务组,2017 年 1 月。https ://tools.ietf.org/html/draft-ietf-idr-bgp-optimal-route-reflection-13。
Raszuk, Robert, Christian Cassar, Bruno Decraene, Stephane Litkowski, Kevin Wang, and Erik Aman. “BGP Optimal Route Reflection (BGP-ORR).” Internet-Draft. Internet Engineering Task Force, January 2017. https://tools.ietf.org/html/draft-ietf-idr-bgp-optimal-route-reflection-13.
雷赫特、雅科夫、苏珊·黑尔斯和托尼·李。边界网关协议 4 (BGP-4)。
Rekhter, Yakov, Susan Hares, and Tony Li. A Border Gateway Protocol 4 (BGP-4).
征求意见 4271。RFC 编辑,2006。doi:10.17487/rfc4271。
Request for Comments 4271. RFC Editor, 2006. doi:10.17487/rfc4271.
雷塔纳、阿尔瓦罗和拉斯·怀特。“BGP 自定义决策流程。” 互联网草案。互联网工程任务组,2017 年 2 月。https: //tools.ietf.org/html/draft-ietf-idr-custom-decision-08。
Retana, Alvaro, and Russ White. “BGP Custom Decision Process.” Internet-Draft. Internet Engineering Task Force, February 2017. https://tools.ietf.org/html/draft-ietf-idr-custom-decision-08.
罗伊、阿拜、易阳和阿尔瓦罗·雷塔纳。在 OSPF 中隐藏仅传输网络。征求意见 6860。RFC 编辑,2013。doi:10.17487/rfc6860。
Roy, Abhay, Yi Yang, and Alvaro Retana. Hiding Transit-Only Networks in OSPF. Request for Comments 6860. RFC Editor, 2013. doi:10.17487/rfc6860.
迈克·尚德、斯特凡诺·普雷维迪、莱斯·金斯伯格和丹尼·R·麦克弗森。IS-IS 链路状态 PDU (LSP) 空间的简化扩展。征求意见 5311。RFC 编辑,2009。doi:10.17487/rfc5311。
Shand, Mike, Stefano Previdi, Les Ginsberg, and Danny R. McPherson. Simplified Extension of Link State PDU (LSP) Space for IS-IS. Request for Comments 5311. RFC Editor, 2009. doi:10.17487/rfc5311.
Suurballe,JW“网络中的不相交路径”。网络4,没有。2(1974):125-45。doi:10.1002/net.3230040204。
Suurballe, J. W. “Disjoint Paths in a Network.” Networks 4, no. 2 (1974): 125–45. doi:10.1002/net.3230040204.
沃拉、奎萨尔和陈恩克。BGP 支持四个八位字节的自治系统 (AS) 编号空间。征求意见 6793。RFC 编辑,2012。doi:10.17487/rfc6793。
Vohra, Quaizar, and Enke Chen. BGP Support for Four-Octet Autonomous System (AS) Number Space. Request for Comments 6793. RFC Editor, 2012. doi:10.17487/rfc6793.
丹尼尔·沃尔顿、阿尔瓦罗·雷塔纳、恩克·陈和约翰·斯卡德。BGP 持续路由振荡的解决方案。征求意见 7964。RFC 编辑,2016。doi:10.17487/rfc7964。
Walton, Daniel, Alvaro Retana, Enke Chen, and John Scudder. Solutions for BGP Persistent Route Oscillation. Request for Comments 7964. RFC Editor, 2016. doi:10.17487/rfc7964.
王、莉莉、张朝晖(杰弗里)和尼沙尔·谢思。OSPF 混合广播和点对多点接口类型。征求意见 6845。RFC 编辑,2013。doi:10.17487/rfc6845。
Wang, Lili, Zhaohui (Jeffrey) Zhang, and Nischal Sheth. OSPF Hybrid Broadcast and Point-to-Multipoint Interface Type. Request for Comments 6845. RFC Editor, 2013. doi:10.17487/rfc6845.
“NP、NP-Complete 和 NP-Hard 之间有什么区别?堆栈溢出。” 访问日期:2017 年 9 月 24 日。https: //stackoverflow.com/questions/1857244/what-are-the-differences- Between-np-np-complete-and-np-hard 。
“What Are the Differences between NP, NP-Complete and NP-Hard? Stack Overflow.” Accessed September 24, 2017. https://stackoverflow.com/questions/1857244/what-are-the-differences-between-np-np-complete-and-np-hard.
怀特、拉斯. 中间系统到中间系统 (IS-IS) 路由协议实时课程。视频。现场课程。思科出版社,2016 年。http ://www.ciscopress.com/store/intermediate-system-to-intermediate-system-is-is-routing-9780134465326 ?link=text&cmpid=2017_02_02_CP_RussWhiteVideo 。
White, Russ. Intermediate System to Intermediate System (IS-IS) Routing Protocol LiveLessons. Video. LiveLessons. Cisco Press, 2016. http://www.ciscopress.com/store/intermediate-system-to-intermediate-system-is-is-routing-9780134465326?link=text&cmpid=2017_02_02_CP_RussWhiteVideo.
怀特、拉斯. “iSPF 与 PRC。” 规则 11 读者,2017 年 6 月 7 日。https: //rule11.tech/ispf-verse-prc/。
White, Russ. “iSPF Versus PRC.” Rule 11 Reader, June 7, 2017. https://rule11.tech/ispf-verse-prc/.
拉斯·怀特、丹尼·麦克弗森和斯里哈里·桑利。实用BGP。马萨诸塞州波士顿:Addison-Wesley Professional,2004 年。
White, Russ, Danny McPherson, and Srihari Sangli. Practical BGP. Boston, MA: Addison-Wesley Professional, 2004.
怀特、拉斯和阿尔瓦罗·雷塔纳。IS-IS:IP 网络中的部署。第一版。马萨诸塞州波士顿:艾迪生韦斯利,2003 年。
White, Russ, and Alvaro Retana. IS-IS: Deployment in IP Networks. 1st edition. Boston, MA: Addison-Wesley, 2003.
1. 通读有关 DFS 编号系统的附加材料。为了简洁起见,正文中省略了低点的概念。您能否详细阐述这个概念以及在确定网络中不相交路径时找到低点的重要性?
1. Read through the additional material on DFS numbering systems. The concept of the low point was left out of the main text for brevity. Can you expand on this concept and the importance of finding the low point in determining disjoint paths through the network?
2. 比较 Bellman-Ford 和 Dijkstra 在具有负成本链路的网络中的运行情况;绘制一个由六七个路由器组成的小型网络,设置其中一个链路因此它在两个方向上都有负成本,并使用两种算法确定通过网络的无环路径集。您不需要运行算法来以正式的方式执行此操作;只需描述 Dijkstra 算法在哪些路径上会出现问题,以及 Bellman-Ford 算法对这些相同路径将如何反应。
2. Compare the operation of Bellman-Ford and Dijkstra in a network with negative cost links; draw a small network of six or seven routers, set one of the links so it has a negative cost in both directions, and determine the set of loop-free paths through the network using both algorithms. You do not need to run the algorithm to do this in a formal way; just describe which paths the Dijkstra algorithm will have a problem with and how the Bellman-Ford algorithm will react to these same paths.
3. 在图 13-19所示的网络上运行 MRT ,并显示生成的不相交拓扑。
3. Run MRT across the network shown in Figure 13-19 and show the resulting disjoint topologies.
4. 从 A 的角度在图 13-19所示的网络中运行 Dijkstra SPF ,每个链路的成本为 1。显示生成的最短路径。
4. Run Dijkstra’s SPF across the network shown in Figure 13-19 from the perspective of A, using a cost of 1 for each link. Show the resulting shortest paths.
5. 使用图13-19所示的网络,假设所有链路成本均为1,那么到达G的最佳路径是什么?D 在这个网络中会有 LFA 或 rLFA 吗?
5. Using the network shown in Figure 13-19, assuming all link costs are 1, what would be the best path toward G? Would D have an LFA or rLFA in this network?
6. 您是否可以想到任何方法来确保存在单向连接问题(例如注释中给出的潜艇示例)的连接?
6. Is there any way you can think of to ensure connectivity exists with a unidirectional connectivity problem, such as the submarine example given in the note?
1 . Dijkstra,“关于与图相关的两个问题的注释”。
1. Dijkstra, “A Note on Two Problems in Connexion with Graphs.”
2 . Suurballe,“网络中的不相交路径”。
2. Suurballe, “Disjoint Paths in a Network.”
您可能已经注意到,第 12 章“单播无环路路径 (1) ”和第 13 章“单播无环路路径 (2) ”中描述的机制很少考虑拓扑变化。这些解决方案中的大多数都侧重于通过表面上稳定的网络计算无环路路径,正如第 11 章“拓扑发现”中描述的机制所发现的那样。但是当拓扑发生变化时会发生什么?回到第二部分的介绍:
You might have noticed that very few of the mechanisms described in Chapter 12 “Unicast Loop-Free Paths (1),” and Chapter 13, “Unicast Loop-Free Paths (2),” considered changes in the topology. Most of these solutions are focused on computing loop-free paths through an apparently stable network, as discovered by the mechanisms described in Chapter 11, “Topology Discovery.” But what happens when the topology changes? Returning to the introduction of Part II:
网络设备如何构建沿着网络中的无环路路径转发数据包所需的表?
How do network devices build the tables needed to forward packets along loop-free paths through the network?
Now it is time to consider one more of the subproblems of this overarching problem:
控制平面如何检测网络变化并做出反应?
How do control planes detect and react to changes in the network?
这个问题将通过检查控制平面中收敛过程的两个组成部分来回答。网络的收敛过程可以分为四个阶段。图14-1用于描述这四个阶段。
This question will be answered by examining two components of the convergence process in a control plane. The convergence process in a network can be described in four stages. Figure 14-1 is used for reference in describing these four stages.
一旦[C,E]链路发生故障,必须发生的四个阶段是检测、分发、计算和安装。
Once the [C,E] link fails, the four stages that must occur are detection, distribution, computation, and installation.
1.检测更改:无论是包含新设备或链接,还是删除设备或链接,无论原因如何,任何连接的设备都必须检测到更改。图14-1中,设备C和E必须检测到[C,E]链路故障;当链路恢复时,他们还必须检测拓扑中是否包含此(显然是新的)链路。
1. Detecting the change: Whether the inclusion of a new device or link, or the removal of a device or link, regardless of the reason, the change must be detected by any connected devices. In Figure 14-1, devices C and E must detect the failure of the [C,E] link; when the link is brought back up, they must also detect the inclusion of this (apparently new) link in the topology.
2.分发有关更改的信息:参与控制平面的每个设备必须以某种方式了解拓扑更改。在图14-1中,设备A、B和D必须以某种方式被通知[C,E]链路的故障;当链路恢复时,必须再次通知他们拓扑中包含此(显然是新的)链路。
2. Distributing information about the change: Each device participating in the control plane must learn about the topology change in some way. In Figure 14-1, devices A, B, and D must somehow be notified of the failure of the [C,E] link; when the link is brought back up, they must again be notified of the inclusion of this (apparently new) link in the topology.
3.计算到目的地的新的无循环路径:这些算法将在第 12章和第 13章中讨论。在图 14-1中,B 和 C 必须计算一些备用路径才能到达 E(或者可能是 E 本身)后面的目的地。
3. Computing a new loop-free path to the destination: These algorithms are discussed in Chapters 12 and 13. In Figure 14-1, B and C must compute some alternate path to reach destinations behind E (or perhaps E itself).
4.将新的转发信息安装到相关的本地转发表中:在图14-1中,B和C必须将新计算的到E以外目的地的无环路径安装到本地转发表中,以便流量可以沿着新路径进行切换。
4. Installing the new forwarding information into the relevant local tables: In Figure 14-1, B and C must install the newly computed loop-free paths to destinations beyond E into their local forwarding tables, so traffic can be switched along the new path.
以下部分将重点介绍前面列表中描述的四个步骤中的前两个步骤,首先是关于检测拓扑更改的一些想法。将考虑专门检测拓扑变化的协议的一些示例。拓扑和可达性信息的分布将占据本章的后半部分。由于这个问题本质上是一个分布式数据库问题,所以将从这个角度来解决。
The following sections will focus on the first two of the four steps described in the preceding list, beginning with some thoughts on detecting topology changes. Some examples of protocols specializing in detecting topology changes will be considered. The distribution of topology and reachability information will take up the final half of this chapter. As this problem is, essentially, a distributed database problem, it will be addressed from that perspective.
对网络拓扑变化做出反应的第一步是检测变化。返回到图14-1,连接到链路的两台设备C和E应该如何检测链路故障?这个问题的解决方案并不像最初看起来那么简单,原因有两个:信息过载和误报。
The first step in reacting to a change in the network topology is to detect the change. Returning to Figure 14-1, how should the two devices connected to the link, C and E, detect the link has failed? The solution to this problem is not as simple as it might first appear for two reasons: information overload and false positives.
当控制平面接收到如此多的信息时,就会发生信息过载,以至于它根本无法分发有关拓扑更改的信息,和/或计算备用路径并将其安装到每个设备的相关表中,足够快以保持网络状态一致。在快速、持续发生变化的情况下,例如链路每隔几毫秒断开和连接,控制平面可能会被信息淹没,导致控制平面本身消耗足够的网络资源,从而导致网络故障。一系列故障也可能触发正反馈循环,在这种情况下,控制平面会自我“折叠”,要么反应非常缓慢,要么完全失败。
Information overload occurs when the control plane receives so much information it simply cannot distribute information about topology changes, and/or compute and install alternate paths into the relevant tables at each device, fast enough to keep the state of the network consistent. In the case of quick, persistently occurring changes, such as a link disconnecting and connecting every few milliseconds, the control plane can be overwhelmed with information, causing the control plane itself to consume enough network resources to cause the network to fail. It is also possible for a series of failures to trigger a positive feedback loop, in which case the control plane “folds in” on itself, either reacting very slowly or failing altogether. The solution to information overload is to hide the true state of the topology from the control plane until the rate of change is within the bounds the control plane can support.
误报是第二类问题。如果链路每 100 个数据包中丢弃一个数据包,并且每次丢弃的单个数据包恰好是用于监视链路状态的控制平面数据包,则该链路会出现非常频繁地断开和恢复(抖动)的情况 —即使其他流量通过链路转发也没有问题。
False positives are the second sort of problem; if a link drops one packet out of every 100, and the single packet dropped each time just happens to be a control plane packet used to monitor the link’s state, the link will appear to go down and come back up (flap) quite frequently—even though other traffic is being forwarded across the link without problem.
事件检测问题有两大类解决方案:
There are two broad classes of solutions to the event detection problem:
• 实现可以定期发送数据包以确定链路、设备或系统的状态。这就是民意调查。
• Implementations can send packets periodically to determine the state of a link, device, or system. This is polling.
• 实施可以触发对系统内某些物理或逻辑状态的链路或设备状态变化的反应。这是事件驱动的。
• Implementations can trigger a reaction to a change in the state of a link or device off some physical or logical state within the system. This is event driven.
与往常一样,这两种解决方案以及每种解决方案的子类别都存在不同的权衡。
There are, as always, different tradeoffs with these two solutions, and subcategories of each one.
轮询可以远程或带外进行;或本地,或带内; 图 14-2说明了这一点。
Polling can be performed remotely, or out of band; or locally, or in band; Figure 14-2 illustrates.
在图 14-2中,A 和 B定期通过它们连接的同一链路以及它们转发流量的同一链路发送hello或某种其他形式的轮询数据包。这是带内轮询,其优点是跟踪正在转发流量的链路状态、正在承载的可达性信息等。另一方面,D 正在轮询 A 和 B,以获取有关 [A,B ] 来自网络中另一个位置的链接。例如,D 可以定期检查 [A,B] 链路上两个接口的状态,或者可能定期沿着 [C,A,B,C] 路径发送数据包,等等。这样做的好处是可以集中大量链路的状态信息,使网络管理和故障排除更加容易。这两种轮询都经常在现实网络部署中使用。
In Figure 14-2, A and B are sending a hello, or some other form of polling packet, periodically across the same link they are connected through, and the same link across which they are forwarding traffic. This is in band polling, which has the advantage of tracking the state of the link over which traffic is being forwarded, reachability information is being carried, etc. On the other hand, D is polling A and B for some information about the state of the [A,B] link from another location in the network. For instance, D could be checking the state of the two interfaces on the [A,B] link on a periodic basis, or perhaps sending a packet along the [C,A,B,C] path on a periodic basis, etc. The advantage here is information about the state of a large number of links can be centralized, making network management and troubleshooting easier. Both kinds of polling are often used in real-world network deployments.
轮询机制通常使用两个独立的计时器来操作:
Polling mechanisms often use two separate timers to operate:
• 一个计时器,用于确定轮询传输的频率;在带外轮询的情况下,这通常称为轮询间隔,在带内轮询的情况下,这通常称为 hello 计时器
• A timer to determine how often the poll is transmitted; this is often called the polling interval in the case of out of band polling, and is often called the hello timer in the case of in band polling
• 计时器,用于确定在宣布链路或设备关闭或发出某种警报之前等待多长时间;在带内轮询的情况下,这通常称为死时间间隔或死计时器
• A timer to determine how long to wait before declaring a link or device down, or to raise some sort of alarm; this is often called a dead interval or dead timer in the case of in band polling
带内和带外轮询的目标通常不同。用于发现网络状态变化的带外轮询通常(但并非总是如此,特别是在集中式控制平面的情况下)用于监视网络状态,并允许对状态变化做出集中反应。带内轮询最常用于(如您所期望的)来检测本地状态的变化,以驱动分布式控制平面的反应。
The objective of in and out of band polling is often different. Out of band polling to discover changes in network state is often (but not always—specifically in the case of a centralized control plane) used to monitor the network state, and allows for centralized reactions to changes in state. In band polling is most often used (as you might expect) to detect changes in state locally, to drive the reaction of distributed control planes.
事件驱动的故障检测依赖于一些本地的、可测量的事件来确定特定链路或设备的状态。如图 14-3所示。
Event-driven failure detection relies on some local, measurable event to determine the status of a particular link or device. Figure 14-3 illustrates.
笔记
Note
这只是一个例子;并非所有路由器实现都遵循此模型。
This is just an example; not all router implementations follow this model.
图14-3显示了物理接口和路由协议之间的架构元素的一种可能实现,有四个步骤:
In Figure 14-3, which shows one possible implementation of the architecture elements between the physical interface and the routing protocol, there are four steps:
1.位于链路两端的两个物理接口 ( phy ) 芯片之间的链路出现故障。物理接口芯片通常是光到电的切换。大多数物理接口芯片还对入站信息执行某种级别的解码,将线路上的各个位转换为数据包(反序列化),并将数据包转换为位(序列化)。信息由物理接口编码到载体上,该载体由连接到物理介质的两个物理芯片提供。如果链路出现故障,或者两个接口之一因任何原因断开连接,链路另一端的物理接口芯片将近乎实时地看到载波丢失——通常基于光速和传输长度。物理媒体。这种情况称为载体丢失。
1. The link between the two physical interface (phy) chips located at either end of the link fails. Physical interface chips are normally optical to electrical hand-offs. Most physical interface chips also perform some level of decoding on the inbound information, converting the individual bits on the wire to packets (deserialization), and packets into bits (serialization). Information is encoded by the physical interface onto a carrier, which is supplied by the two physical chips connected to the physical media. If the link fails, or one of the two interfaces is disconnected for any reason, the physical interface chip on the other end of the link will see the carrier drop in near real time—usually based on the speed of light and the length of the physical media. This condition is called loss of carrier.
2. 物理接口芯片在检测到载波丢失时,将向本地设备上的路由表 (RIB) 发送通知。此通知通常以中断形式开始,然后转换为某种形式的应用程序编程接口 (API) 调用 RIB 代码,从而导致通过接口可达的路由,以及通过接口的任何下一跳信息,被标记为过时或从路由表中删除。该信号可能会也可能不会通过转发信息库 (FIB),具体取决于实现。
2. The physical interface chip will, on detecting loss of carrier, send a notification toward the routing table (RIB) on the local device. This notification normally starts life as an interrupt, which is then translated into some form of Application Programming Interface (API) call into the RIB code, which results in the routes reachable through the interface, and any next hop information through the interface, being marked stale or being removed from the routing table. This signal may, or may not, pass through the Forwarding Information Base (FIB) along the way, depending on the implementation.
3. RIB 将根据接口 down 事件通知路由协议它刚刚从本地表中删除的路由。
3. The RIB will notify the routing protocol about the routes it just removed from the local table based on the interface down event.
4. 然后,路由协议可以删除通过指示的接口(或者更确切地说,通过连接的路由)可到达的任何邻居。
4. The routing protocol can then remove any neighbors reachable through the indicated interfaces (or rather through the connected routes).
图 14-3中没有任何周期性进程检查任何状态的点,也没有任何数据包通过线路传输。整个过程基于物理接口芯片在连接介质上丢失载体;因此这个过程是事件驱动的。
There is no point in Figure 14-3 in which there is a periodic process checking the state of anything, nor are there any packets moving across the wire. The entire process is based on the physical interface chip losing carrier on the connected media; hence this process is event driven.
通常情况下,事件驱动状态和轮询状态会结合在一起。例如,在图14-3中,如果有一个管理站定期轮询本地RIB中的接口状态,则从物理接口芯片组到RIB的过程将是事件驱动的,而从物理接口芯片组到RIB的过程将是事件驱动的。管理站的 RIB 将通过轮询驱动。
It is often the case that event-driven and polled status are combined. For instance, in Figure 14-3, if there were a management station polling the status of the interface in the local RIB on a periodic basis, the process from the physical interface chipset to the RIB would be event driven, while the process from the RIB to the management station would be driven by polling.
表 14-1总结了每种事件检测机制的优点和缺点。
Table 14-1 summarizes the advantages and disadvantages of each event detection mechanism.
Table 14-1 Comparison of Polling and Event-Driven Detection
|
带外轮询 Out of Band Polling |
带内轮询 In Band Polling |
事件驱动 Event Driven |
状态分布 Status Distribution |
状态由中心化系统驱动;集中式系统可以更全面地了解整个网络状态 Status is driven from a centralized system; the centralized system has a bigger picture view of the overall network state |
状态由本地设备驱动;收集整个网络状态的全局视图需要从每个单独的网络设备收集信息 Status is driven by local devices; gathering a bigger picture view of the state of the entire network requires gathering information from each individual network device |
状态由本地设备驱动;收集整个网络状态的全局视图需要从每个单独的网络设备收集信息 Status is driven by local devices; gathering a bigger picture view of the state of the entire network requires gathering information from each individual network device |
链路和/或设备状态可能会被错误报告;不直接测试转发能力 Link and/or device state can be falsely reported; does not directly test forwarding capability |
链路和/或设备状态可以直接与转发能力相关(除非状态检查机制内出现故障) Link and/or device state can be directly tied to forwarding capability (barring failures within the state checking mechanism) |
链路和/或设备状态可以直接与转发能力相关(除非状态检查机制内出现故障) Link and/or device state can be directly tied to forwarding capability (barring failures within the state checking mechanism) |
|
检测速度 Speed of Detection |
在声明链路或设备失败之前必须有一定的等待间隔,以防止误报;减慢网络变化的报告速度 Must have some waiting interval before declaring a link or device failed to prevent false positives; slows reporting of network changes |
在声明链路或设备失败之前必须有一定的等待间隔,以防止误报;减慢网络变化的报告速度 Must have some waiting interval before declaring a link or device failed to prevent false positives; slows reporting of network changes |
在报告故障之前可能需要一些计时器来减少误报的报告,但该计时器可能非常短,并且需要对系统本身的状态进行双重检查;通常报告网络变化的速度要快得多 Some timer before reporting failures might be desirable to reduce the reporting of false positives, but this timer can be very short, and backed with a double-check of the state of the system itself; generally much faster at reporting network changes |
缩放 Scaling |
必须传输定期轮询,消耗带宽、内存和处理周期;在这些限制内扩展 Must transmit periodic polls, consuming bandwidth, memory, and processing cycles; scales within these limits |
必须传输定期轮询,消耗带宽、内存和处理周期;在这些限制内扩展 Must transmit periodic polls, consuming bandwidth, memory, and processing cycles; scales within these limits |
少量当前本地状态;往往比轮询机制具有更好的扩展性 Small amounts of current local state; tends to scale better than polling mechanisms |
虽然事件驱动检测似乎总是受到青睐,但在某些特定情况下,轮询可以解决事件驱动机制无法解决的问题。例如,基于轮询的系统的主要优点之一,特别是在带内部署时,是“查看”其他不可见盒子的状态。例如,在图 14-4中,有两个路由器通过第三个设备(在图中标识为中继器)连接。
While it may appear event-driven detection should always be favored, there are some specific situations where polling can solve problems that event-driven mechanisms cannot. For instance, one of the main advantages of polling-based systems, particularly when deployed in band, is to “see” the state of otherwise invisible boxes. For instance, in Figure 14-4, there are two routers connected through a third device, identified as a repeater in the illustration.
图14-4中,设备B是一个简单的物理中继器;无论它在 [A,B] 链路上接收到什么,它都会在 [B,C] 链路上重新传输,就像它在 [B,C] 链路上接收到的一样。该设备上没有运行任何类型的控制平面(至少 A 和 C 不知道)。A 和 C 都无法检测到该设备,因为它不会以 A 或 C 可以测量的任何方式改变信号。
In Figure 14-4, device B is a simple physical repeater; whatever it receives on the [A,B] link it retransmits, just as it received it, on the [B,C] link. There is no control plane of any sort running on this device (at least not that A and C are aware of). Neither A nor C can detect this device, as it does not change the signal in any way A or C could measure.
如果 A 和 B 使用事件驱动机制来确定链路状态,如果 [A,B] 链路发生故障,会发生什么情况?当然,A 将失去运营商,因为 B 的物理接口将不再可达。然而,C 将继续接收载波,因此根本不会检测到链路故障。如果 A 和 C 能够以某种方式与 B 进行通信,这种情况就可以得到解决。例如,如果 B 跟踪它收到的所有地址解析协议 (ARP) 请求,则当 [A,B] 链路发生故障时,它可以以某种方式发送“反向 ARP”,通知 B A 不再可达。这种情况下可用的另一个解决方案是 A 和 C 之间的某种轮询,用于验证整个链路的可达性,包括 B 的状态(即使 A 和 C 不知道 B 存在)。
What happens if the [A,B] link fails if A and B are using an event-driven mechanism to determine link state? A will lose carrier, of course, because the physical interface at B will no longer be reachable. However, C will continue to receive carrier and hence will not detect the link failure at all. If it is possible for A and C to somehow communicate with B, this situation can be resolved. For instance, if B tracks all the Address Resolution Protocol (ARP) requests it receives, it can, when the [A,B] link fails, somehow send an “inverse ARP” notifying B that A is no longer reachable. The other solution available in this situation is some sort of polling between A and C that verifies reachability across the entire link, including the state of B (even though A and C are not aware that B exists).
从复杂性的角度来看,事件驱动的检测增加了网络中系统之间的交互面,而轮询则倾向于保留系统内的状态。在图14-3中,物理接口芯片组、RIB和路由协议实现之间必须存在某种接口。这些接口中的每一个都代表一个可以通过抽象更好地隐藏的信息在系统之间传输的地方,以及必须维护和管理的接口。另一方面,轮询通常可以包含在单个系统中,完全忽略现有的底层机制和技术。
From a complexity perspective, event-driven detection increases the interaction surfaces between the systems in a network, while polling tends to keep state within a system. In Figure 14-3, there must be some sort of interface between the physical interface chipset, the RIB, and the routing protocol implementation. Each of these interfaces represents a place where information that might be better hidden through an abstraction is transferred between systems, and an interface that must be maintained and managed. Polling, on the other hand, can often be contained within a single system, completely ignoring the underlying mechanisms and technologies in place.
此时,花几页的时间来研究专门为检测网络中的链路状态而设计的协议示例将很有用。这些协议都不是更大系统(例如路由协议)的一部分,而是通过编程接口和状态指示器与其他协议交互。
It will be useful, at this point, to spend a few pages examining an example of a protocol designed specifically to detect link state in a network. Neither of these protocols is part of a larger system (such as a routing protocol), but rather interact with other protocols through programming interfaces and status indicators.
双向转发检测(BFD)基于一个观察结果:典型的网络设备上运行着许多控制平面,每个控制平面都有自己的故障检测机制。在所有不同的控制平面之间运行单一的共享检测机制会更有效。在大多数应用中,BFD 不会取代每个控制平面中使用的现有 hello 协议,而是对其进行增强。如图 14-5所示。
Bidirectional Forwarding Detection (BFD) is grounded in a single observation: there are many control planes running on a typical network device, each with its own failure detection mechanism. It would be more efficient to run a single shared detection mechanism among all the different control planes. In most applications, BFD does not replace existing hello protocols used in each control plane, but rather augments them. Figure 14-5 illustrates.
在 BFD 模型中,可能至少有两个不同的轮询进程在同一逻辑链路上运行(如果有逻辑链路分层在其他逻辑链路之上,则可能有更多进程,因为 BFD 可以跨各种网络使用以及虚拟化技术)。控制平面轮询将使用 hello 来发现运行相同控制平面进程的相邻设备,交换功能,确定最大传输单元 (MTU),最后确定相邻设备上的控制平面进程仍在运行。这些 hello 是通过图 14-5中的控制平面连接运行的,可以将其视为一种穿过物理链路的“虚拟链路”。
In the BFD model, there are likely to be at least two different polling processes running over the same logical link (there could be more, if there are logical links layered on top of other logical links, as BFD can be used across various network virtualization technologies, as well). Control plane polling will use hellos to discover adjacent devices running the same control plane process, to exchange capabilities, determine the Maximum Transmission Unit (MTU), and, finally, to make certain the control plane process on the adjacent device is still running. These hellos are run across the control plane connection in Figure 14-5, which can be seen as a sort of “virtual link” passing through the physical link.
BFD 轮询将在控制平面连接下运行,如图所示,验证两个连接设备上的物理连接和转发平面的操作。这种两层方法使得 BFD 的运行速度比任何基于路由协议的检测机制都要快得多,即使作为轮询机制也是如此。
BFD polling will run underneath the control plane connection, as shown, verifying the operation of the physical connection and forwarding planes on the two connected devices. This two-layered approach allows BFD to operate much more quickly, even as a polling mechanism, than any routing protocol-based detection mechanism.
BFD 可以在四种不同的模式下运行:
BFD can operate in four distinct modes:
•异步模式:在此模式下,BFD 的作用类似于轻量级hello 协议。A 处的 BFD 进程可能运行在分布式进程上(甚至在专用集成电路 [ASIC] 中),将 hello 数据包发送到 C;C 处的 BFD 进程确认这些 hello 数据包。这是通过 hello 进行轮询的相当传统的用法。
• Asynchronous mode: In this mode, BFD acts like a lightweight hello protocol. The BFD process at A, potentially running on a distributed process (or even in an Application-Specific Integrated Circuit [ASIC]), sends hello packets to C; the BFD process at C acknowledges these hello packets. This is a fairly traditional use of polling through hellos.
•带回显的异步模式:在此模式下,A 中的 BFD 进程将向 C 发送 hello 数据包,因此 hello 数据包将仅通过转发路径进行处理,因此仅允许轮询转发路径。为了实现这一点,A 向 C 发送 hello 数据包,并以将它们转发回 A 的方式形成。例如,A 可以向 C 发送一个数据包,其中以 A 自己的地址为目的地;C 可以拾取该数据包并将其转发回 A。在这种模式下,A 发送的 hello 与 C 发送的 hello 完全不同;没有确认,只是两个系统发送独立的问候语,从每一端双向测试链路。
• Asynchronous mode with echo: In this mode, the BFD process in A will send hello packets to C so the hello packets will be processed only through the forwarding path, hence allowing only the forwarding path to be polled. To accomplish this, A sends hello packets to C formed in such a way that they will be forwarded back to A. For instance, A can send a packet to C with A’s own address as the destination; C can pick this packet up and forward it back to A. In this mode, the hellos transmitted by A are completely different from the hellos transmitted by C; there is no acknowledgment, just the two systems sending independent hellos that test the link bidirectionally from each end.
•需求模式:在此模式下,两个 BFD 对等体同意仅在需要验证连接时发送 hello,而不是定期发送。这在有其他方法确定链路状态的情况下非常有用,例如,如果 [A,C] 链路是以太网链路,这意味着载波检测可用于检测链路故障,但当替代方法是不一定能够在所有情况下提供准确的连接状态。例如,在“中间交换机”的情况下,B 与 A 断开连接,但 C 未断开连接,C 可以在注意到连接存在任何问题时发送 BFD hello,以验证其与 A 的连接是否仍然良好。在按需模式下,某些事件(例如丢包)可能会导致本地进程触发 BFD 检测事件。
• Demand mode: In this mode, the two BFD peers agree to send hellos just when connectivity needs to be validated, rather than periodically. This is useful in the case where there is some other way to determine link status—for instance, if the [A,C] link is an Ethernet link, which means carrier detect is available to detect link failure—but when the alternate method is not necessarily trusted to provide accurate connectivity status in all situations. For instance, in the case of “switch in the middle,” where B is disconnected from A but not C, C could send a BFD hello on noting any problem with the connectivity to verify its connection with A is still good. In demand mode, some event, such as a lost packet, can cause a local process to trigger a BFD detection event.
•带回显的请求模式:此模式类似于请求模式— 运行BFD 的两个设备之间不会传输常规问候语。当传输数据包时,它的发送方式会导致其他设备将 hello 数据包转发回发送方。这减少了两个设备上的处理器负载量,从而允许使用更快的计时器来进行 BFD hello。
• Demand mode with echo: This mode is like demand mode—regular hellos are not transmitted between the two devices running BFD. When a packet is transmitted, it is sent in such a way as to cause the other device to forward the hello packet back to the sender. This reduces the amount of processor load on both devices, allowing much faster timers to be used for BFD hellos.
无论何种操作模式,BFD 都会在链路上分别计算不同的轮询(hello)和检测(dead)计时器。解释该过程的最佳方式是通过示例。假设A发送BFD控制报文,建议轮询间隔为500ms,C发送BFD控制报文,建议轮询间隔为700ms。为关系选择较高的数字,或者更慢的轮询间隔;这样做的理由是较慢的系统必须能够跟上轮询间隔以防止误报。
Regardless of the mode of operation, BFD calculates different polling (hello) and detection (dead) timers separately across the link. The best way to explain the process is through an example. Assume A sends a BFD control packet with a proposed polling interval of 500ms, and C sends a BFD control packet with a proposed polling interval of 700ms. The higher number, or rather the slower polling interval, is chosen for the relationship; the rationale for this is the slower system must be able to keep up with the polling interval to prevent false positives.
在实际使用中会修改轮询速率,以防止同一线路上多个系统之间的 hello 数据包同步。如果有四个或五个系统在单个多路访问链路上部署边界网关协议 (BGP),并且每个系统都设置其计时器以根据上一个数据包的接收情况发送下一个 hello 数据包,则所有五个系统都可能同步它们的 hello 传输,以便线路上的所有 hello 都在同一时刻传输。由于 BFD 通常使用长度小于一秒的计时器运行,这可能会导致设备同时接收来自多个设备的 hello,并且无法足够快地处理它们以防止误报。
The polling rate is modified in actual use to prevent synchronization of hello packets across multiple systems on the same wire. If there were four or five systems deploying the Border Gateway Protocol (BGP) on a single multiaccess link, and every system sets its timer to send the next hello packet based on the receipt of the last packet, it is possible for all five systems to synchronize their hello transmission so all the hellos on the wire are transmitted at precisely the same moment. Since BFD normally operates with timers less than one second in length, this could result in a device receiving hellos from multiple devices at the same time, and not being able to process them quickly enough to prevent a false positive.
具体使用的修改是对数据包进行抖动;每个发射机必须采用基本轮询计时器并减去轮询计时器的 0% 到 25% 之间的一些随机时间量。例如,如果轮询计时器为 700 毫秒(如给出的示例所示),A 和 C 将在传输最后一个 hello 后大约 562 到 750 毫秒之间的某个时间传输每个 hello 数据包。
The specific modification used is to jitter the packets; each transmitter must take the base polling timer and subtract some random amount of time that is between 0% and 25% of the polling timer. For instance, if the polling timer is 700ms, as in the example given, A and C would transmit each hello packet sometime between around 562 and 750ms after the transmission of the last hello.
最后要考虑的一点是 A 和 C 在宣布链路(或邻居)关闭之前等待的时间。在BFD中,每个设备都可以计算自己的失效定时器,通常表示为轮询定时器的倍数。例如,A 可以选择在错过两个 BFD hello 后考虑链路(或 C)关闭,而 C 可能决定等待错过三个 BFD hello。
The final point to consider is the amount of time A and C will wait before declaring the link (or neighbor) down. In BFD, each device can calculate its own dead timer, normally expressed as a multiple of the polling timer. For instance, A could choose to consider the link (or C) down after two BFD hellos are missed, while C might decide to wait for three BFD hellos to be missed.
一旦检测到网络拓扑的变化,就必须以某种方式将其分发到参与控制平面的所有设备。网络拓扑中的每一项都可以描述为
Once a change in the network topology has been detected, it must be distributed in some way to all the devices participating in the control plane. Each item in a network topology can be described as either
• 链路或边,包括连接到该链路的节点或可到达的目的地
• A link, or edge, including the nodes or reachable destinations attached to this link
• 设备或节点,包括连接到该设备的节点、链路和可到达的目的地
• A device, or node, including the nodes, links, and reachable destinations connected to this device
这组相当受限制的术语适合保存在表或数据库中,通常称为拓扑表或拓扑数据库。那么,将网络拓扑中的更改分发到参与控制平面的所有设备的问题可以描述为将更改分发到整个网络中的该表或数据库中的特定行的过程。
This rather restricted set of terms lends itself to being held in a table, or database, often called the topology table or topology database. The question of distributing changes in the network topology to all the devices participating in the control plane, then, can be described as the process of distributing changes to specific rows in this table or database throughout the network.
当然,信息通过网络分发的方式取决于协议的设计,但常用的分发类型有三种:逐跳分发、洪泛分发和某种集中存储。
The way in which information is distributed through a network depends on the design of the protocol, of course, but there are three commonly used kinds of distribution: hop-by-hop distribution, flooded distribution, and a centralized store of some sort.
在洪泛中,参与控制平面的每个设备都会接收并存储有关网络拓扑和可到达目的地的每条信息的副本。虽然同步数据库或表的方法有很多种,但控制平面中通常只使用一种:记录级复制。如图 14-6所示。
In flooding, each device participating in the control plane receives, and stores, a copy of every piece of information about the network topology and reachable destinations. While there are a number of ways to synchronize a database, or table, only one is normally used in control planes: record-level replication. Figure 14-6 illustrates.
在图14-6中,每个设备将其知道的信息洪泛给每个邻居,然后邻居将信息重新洪泛给每个邻居。例如,A 知道有关网络拓扑的两个具体信息:如何到达 2001:db8:3e8:100::/64 以及如何到达 B。A 将这些信息洪泛到 B,B 又将这些信息洪泛到 C网络中的每个设备最终都会获得所有可用拓扑信息的副本;A、B、C 具有同步的拓扑数据库(或表)。
In Figure 14-6, each device will flood the information it knows to each neighbor, who will then reflood the information to each neighbor. For instance, A knows two specific things about the network topology: how to reach 2001:db8:3e8:100::/64 and how to reach B. A floods this information to B, which, in turn, floods this information to C. Each device in the network ultimately ends up with a copy of all the topology information available; A, B, and C have synchronized topology databases (or tables).
在图14-6中,C与D的连接被显示为数据库中的一个项目;并非所有控制平面都会包含此信息。相反,C 可能只包含到 2001:db8:3e8:102::/64 地址范围(或子网)的连接,其中包含 D 的地址。
In Figure 14-6, C’s connectivity to D is shown as an item in the database; not all control planes would include this information. Instead, C may just include connectivity to the 2001:db8:3e8:102::/64 range of addresses (or subnet), which contains D’s address.
笔记
Note
在较大的网络中,设备连接的完整描述不可能容纳在单个 MTU 大小的数据包中,并且连接信息需要定期超时和重新洪泛以确保新鲜度。
In larger networks, it is impossible for the entire description of a device’s connections to fit into a single MTU-sized packet, and connection information needs to be timed out and reflooded on a regular basis to ensure freshness.
泛滥的分发机制中出现了一个有趣的问题,它可能导致临时路由循环,称为微循环;如图 14-7所示。
An interesting problem arises in flooded distribution mechanisms that can cause temporary routing loops, called microloops; Figure 14-7 illustrates.
在图14-7中,假设[E,D]链路出现故障。考虑以下事件链,包括每个事件的一些大致可能发生的时间:
In Figure 14-7, assume the [E,D] link fails. Consider the following chain of events, including some roughly possible times for each event:
1.开始: A正在利用E到达D;C 正在使用 D 来到达 E。
1. Start: A is using E to reach D; C is using D to reach E.
2. 100ms: E、D发现链路故障。
2. 100ms: E and D discover the link failure.
3. 500ms: E、D向C、A洪泛拓扑变化信息。
3. 500ms: E and D flood information about the topology change to C and A.
4. 750ms: C and A receive the updated topology information.
5. 1,000ms: E和D重新计算各自的最佳路径;E选择A作为到达D的最佳路径,D选择C作为到达E的最佳路径。
5. 1,000ms: E and D recompute their best paths; E selects A as its best path to reach D, D selects C as its best path to reach E.
6. 1,250ms: A和C向B洪泛拓扑变化信息。
6. 1,250ms: A and C flood information about the topology change to B.
7. 1,400ms: A和C重新计算各自的最佳路径;A选择B到达D,C选择B到达E。
7. 1,400ms: A and C recompute their best paths; A selects B to reach D, C selects B to reach E.
8、1500ms : B收到更新的拓扑信息。
8. 1,500ms: B receives the updated topology information.
9. 2,000ms: B重新计算其最佳路径;它选择C到达D,选择A到达E。
9. 2,000ms: B recomputes its best paths; it chooses C to reach D, and A to reach E.
虽然时间和顺序在任何特定网络中可能略有不同,但发现、通告和重新计算的顺序几乎总是遵循类似的模式。在此示例中,在步骤 5 和步骤 7 之间形成微循环;在 400ms 内,A 使用 E 到达 D,E 使用 A 到达 D。在 E 重新计算到 D 的最佳路径和 A 重新计算到 D 的最佳路径之间的时间内,在 A 或 D 处进入环的任何流量。 D将循环。这个问题的更正式的定义将在后面的“一致性、可访问性和可分区性”部分中考虑。该问题的一种解决方案是预先计算无循环替代或远程无循环替代(两者都在第 13 章中讨论)。
While the times and ordering might vary slightly in any particular network, the ordering of discovery, advertisement, and recomputing will almost always follow a similar pattern. In this example, a microloop forms between steps 5 and 7; for 400ms, A is using E to reach D, and E is using A to reach D. Any traffic entering the ring at either A or D during the time between E’s recalculation of the best path to D and A’s recalculation of the best path to D will loop. A more formal definition of this problem will be considered in the later section, “Consistency, Accessibility, and Partitionability.” One solution to this problem is to precompute Loop-Free Alternates or remote Loop-Free Alternates (both discussed in discussed in Chapter 13).
在逐跳分发中,每个设备计算本地最佳路径并将最佳路径发送给其邻居。如图 14-8所示。
In hop-by-hop distribution, each device computes a local best path and sends just the best path to its neighbors. Figure 14-8 illustrates.
在图 14-8中,每个设备都会向其每个邻居通告有关其可以访问的信息。例如,D 向 E 通告可达性,B 向 A 通告 C、D 和 E 的可达性。考虑一下当 A 通过沿网络顶部的链路向 E 通告其可达性时会发生什么情况是很有趣的。一旦 E 收到此信息,它将有两条到达 B 的路径,例如:一条经过 D,一条经过 A。同样,A 将有两条到达 B 的路径:一条经过 D,一条经过 A。前面章节中讨论的任何最短路径算法都可以确定使用这些路径中的哪一条,但是微循环是否有可能通过泛洪分配机制形成?考虑:
In Figure 14-8, each device advertises information about what it can reach to each of its neighbors. D, for instance, advertises reachability to E, and B advertises reachability to C, D, and E toward A. It is interesting to consider what happens when A advertises its reachability toward E through the link along the top of the network. Once E receives this information, it will have two paths to B, for instance: one through D and one through A. In the same way, A will have two paths to B: one directly to B and another through E. Any of the shortest path algorithms discussed in previous chapters can determine which of these paths to use, but is it possible for microloops to form with a flooded distribution mechanism? Consider:
1. E选择经过A到达B的路径。
1. E chooses the path through A to reach B.
2. [A,B]链路故障。
2. The [A,B] link fails.
3. A检测到故障,切换到经过E的路径。
3. A detects this failure, and switches to the path through E.
4. 然后,A 将这条新路径通告给 E。
4. A then advertises this new path to E.
5. E接收到改变后的拓扑信息,并计算一条经过D的新的最佳路径。
5. E receives the changed topology information and calculates a new best path through D.
在步骤 3 和步骤 5 之间的时间内,A 将指向 E 作为其到 B 的最佳路径,而 E 将指向 A 作为其到 B 的最佳路径——一个微循环。大多数逐跳分发系统通过水平分割或毒性逆转来解决这个问题。定义如下:
During the time between steps 3 and 5, A will point to E as its best path to B, while E will point to A as its best path to B—a microloop. Most hop-by-hop distribution systems resolve this through split horizon or poison reverse. Defined, these are as follows:
• 水平分割规则规定:设备不应通告其用于到达目的地的目的地的可达性。
• The split horizon rule states: a device should not advertise reachability toward a destination it is using to reach the destination.
• 毒性逆转规则规定:设备应向它所使用的相邻设备通告目的地,以达到具有无限度量的目的地。
• The poison reverse rule states: a device should advertise destinations toward the adjacent device it is using to reach the destination with an infinite metric.
如果在图 14-8中实现水平分割,E 就不会通告到 B 的可达性,因为它使用通过 A 的路径到达 B。或者,E 可以毒害通过 A 到 B 的路由,这将具有确保A 没有经过 E 到 B 的路径。
If split horizon is implemented in Figure 14-8, E would not advertise reachability to B, as it is using the path through A to reach B. Alternatively, E could poison the route to B through A, which would have the effect of ensuring A has no path through E to B.
在集中式系统中,每个网络设备都会报告有关拓扑变化和控制器可达性的信息,或者更确切地说,报告一些盒外服务和充当控制器的设备的集合。虽然集中化通常会让人想到单个设备(或虚拟设备)向其报告所有信息,并向网络中的所有数据包处理设备提供正确的转发信息,但这过于简单化了集中式控制平面的真正含义方法。图 14-9说明了这一点。
In a centralized system, each network device reports information about changes to the topology and reachability to a controller, or rather some collection of off box services and devices acting as a controller. While centralization often evokes the idea of a single device (or virtual device) to which all information is reported, and which feeds the correct forwarding information to all the packet processing devices in the network, this is an oversimplification of what a centralized control plane really means. Figure 14-9 illustrates.
图14-9中,当D和F之间链路出现故障时:
In Figure 14-9, when the link between D and F fails:
1. D和F都向控制器Y报告拓扑变化。
1. D and F both report the topology change to the controller, Y.
2. Y 将此信息转发给另一个控制器 X。
2. Y forwards this information to the other controller, X.
3. Y 计算没有 [D,F] 链路的到每个目的地的最佳路径,并将其发送到网络中的每个受影响的设备。
3. Y computes the best path to each destination without the [D,F] link and sends it to each affected device in the network.
4. 每个设备都会将此新的转发信息安装到其本地表中。
4. Each device installs this new forwarding information into its local table.
步骤 3 的一个具体实例是 Y 计算没有 [D,F] 链路的到 E 的下一个最佳路径,并将其发送到 D 以安装在其本地转发表中。微循环可以在集中控制平面中形成吗?
A specific instance of step 3 is Y computing a next best path to E without the [D,F] link, and sending it to D to install in its local forwarding table. Can microloops form in a centralized control plane?
• X 和Y 中的数据库需要同步,以便两个控制器计算通过网络的相同无环路路径。
• The databases in X and Y need to be synchronized for both controllers to compute the same loop-free paths through the network.
• 同步这些数据库将面临相同的挑战,并且(可能)使用相同的解决方案,正如本章到目前为止所讨论的解决方案。
• Synchronizing these databases will involve the same challenges, and (probably) use the same solutions, as the solutions discussed thus far in this chapter.
• 连接的设备需要一些时间来发现拓扑变化并将变化报告给控制器。
• There will be some time required for the connected devices to discover the change in topology and report the change to the controller.
• 控制器需要一些时间来计算新的无环路路径。
• There will be some time required for the controller to compute new loop-free paths.
• 控制器需要一些时间来通知受影响的设备通过网络的新无环路路径。
• There will be some time required for the controller to notify the affected devices of the new loop-free paths through the network.
在此描述的时间间隔期间,网络仍然有可能形成微循环。集中式控制平面通常意味着控制平面不在转发流量的设备上运行。尽管它们看起来完全不同,但集中式控制平面实际上使用许多与分布式控制平面相同的机制来分布拓扑和可达性,以及相同的算法来计算通过网络的无环路径。
During the timing intervals described here, it is still possible for the network to form microloops. A centralized control plane most often translates to the control plane is not running on the devices forwarding traffic. Although they may seem radically different, centralized control planes actually use many of the same mechanisms to distribute topology and reachability, and the same algorithms to compute loop-free paths through the network, as distributed control planes.
在本章讨论的所有三种分发系统中——洪泛、逐跳和集中存储——都出现了微循环问题。实现这些技术的协议具有各种系统,例如水平分割和无循环替代,来解决这些微循环,或者它们允许微循环发生,假设结果在网络上不会太大。是否有一个统一的理论或模型可以让工程师理解通过网络分发数据所固有的问题以及所涉及的各种权衡?
In all three distribution systems discussed in this chapter—flooding, hop by hop, and centralized stores—the problem of microloops arises. Protocols implementing these techniques have various systems, such as split horizon and Loop-Free Alternates, to work around these microloops, or they allow the microloop to occur, assuming the results will not be too great on the network. Is there a unifying theory or model that will allow engineers to understand the problems inherent in the distribution of data through a network and the various tradeoffs involved?
有:CAP定理。
There is: the CAP theorem.
2000 年,Eric Brewer 致力于理论和实践的研究,假设分布式数据库具有三个品质:一致性、可访问性和分区容错性 (CAP)。在这三者之间,总是需要权衡,因此您可以在任何系统设计中选择三者中的两个。这个猜想后来被数学证明是正确的,现在被称为 CAP 定理。这三个术语定义为
In 2000, Eric Brewer, working on both theoretical and practical pursuits, postulated there are three qualities to a distributed database: Consistency, Accessibility, and Partition tolerance (CAP). Between these three, there is always a tradeoff such that you can choose two of the three in any system design. This conjecture, later proved true mathematically, is now known as the CAP theorem. The three terms are defined as
•一致性:每个读者都会看到数据库内容的一致视图。如果某个设备 C 在另外两个设备 A 和 B 从数据库读取数据之前写入数据库,则两个读取器将收到相同的信息。换句话说,数据库的写入与读者 A 和 B 能够读取刚刚写入的信息之间没有延迟。
• Consistency: Every reader sees a consistent view of the contents of the database. If some device C writes to the database moments before two other devices, A and B, read from the database, the two readers will receive the same information. In other words, there is no lag between the writing of the database and both of the readers, A and B, being able to read the information that was just written.
•可访问性:每个读者都可以在需要时(近乎实时)访问数据库。对读取的响应可能会延迟,但每次读取都会收到响应。另一种说法是每个读者都可以随时访问数据库;读者永远不会收到“您现在无法查询此数据库”的答案。
• Accessibility: Every reader has access to the database when required (in near real time). The response to a read may be delayed, but every read will receive a response. Another way to put this is every reader has access to the database all the time; there is no time during which a reader would receive the answer “you cannot query this database right now.”
•分区容错性:将数据库复制或分区到多个设备上的能力。
• Partition tolerance: The ability of the database to be copied, or partitioned onto multiple devices.
在小型网络中看CAP定理比较简单;图 14-11用于此目的。
It is simpler to see the CAP theorem in a small network; Figure 14-11 is used for this.
假设 A 包含 C 和 D 都必须访问的数据库的单个副本。假设 C 将一些信息写入数据库,然后紧接着 C 和 D 都读取了相同的信息。为了使某些 C 和 D 接收到相同的信息,必须进行的唯一处理是在 A 本身上。现在,复制数据库,因此 E 上有一个副本,F 上有另一个副本。现在假设 K 写入 E 上的副本,L 从 F 上的副本读取。会发生什么?
Assume A contains a single copy of a database that both C and D must access. Assume C writes some information to the database and then immediately after C and D both read the same information. The only processing that must take place to make certain C and D receive the same information is on A itself. Now, replicate the database, so there is a copy on E and another copy on F. Now assume K writes to the replica on E, and L reads from the replica on F. What will happen?
• F 可以返回它当前拥有的值,即使它与K 刚刚写入的值不同。这意味着数据库返回不一致的答复,因此对数据库进行分区牺牲了一致性。
• F could return the value it currently has, even though it is not the same value K just wrote. This means the database returns an inconsistent reply, so consistency has been sacrificed by partitioning the database.
• 如果两个数据库同步,当然,答复最终将相同,但需要一些时间来打包更改(编组数据)、将其传输到F 并将更改集成到F 的本地副本中。在同步发生时,F 可以锁定数据库或数据库的特定部分。这种情况下,当L读取数据时,可能会收到记录被锁定的回复。在这种情况下,可访问性会丢失,但数据库的一致性和分区会保留。
• If the two databases are synchronized, the reply will eventually be the same, of course, but it will take some time to package the change up (marshal the data), transfer it to F, and integrate the change into F’s local copy. F could lock the database, or a specific part of the database, while the synchronization is taking place. In this case, when L reads the data, it may receive a reply that the record is locked. In this case, accessibility is lost, but consistency and the partitioning of the database are preserved.
• 如果合并两个数据库,则可以保持一致性和可访问性,但代价是分区。
• If the two databases are merged, then consistency and accessibility can be preserved, at the cost of partitioning.
没有办法解决这个问题,因此由于同步数据库的两个副本之间的信息需要时间,所以所有三个都被保留。同样的问题也适用于分片数据库。
There is no way to work it out so all three are preserved because of the time required to synchronize the information between the two copies of the database. The same problem holds true for a sharded database.
这如何应用于控制平面?在分布式控制平面中,控制平面从中提取信息以计算无环路径的数据库在整个网络上进行分区。此外,数据库可以随时在本地读取,以便计算无环路路径。考虑到控制平面中使用的分布式数据库所需的分区和可访问性,您应该预期一致性会受到影响 - 事实上,它确实会导致收敛期间的微循环。集中控制平面并不能“解决”这个问题;相反,它只是转移问题,或者允许设计者在权衡中做出不同的选择。在单个设备上运行的集中控制平面将始终保持一致,但并不总是可访问,
How does this apply to control planes? In a distributed control plane, the database from which the control plane draws information to calculate loop-free paths is partitioned across the entire network. Further, the database is locally readable at any time in order to calculate loop-free paths. Given the partitioning and accessibility required of the distributed database used in a control plane, you should expect consistency to suffer—and it does, resulting in microloops during convergence. A centralized control plane does not “solve” this problem; rather it just moves the problem around, or allows the designer to make different choices in the tradeoffs. A centralized control plane running on a single device will always be consistent, but it will not always be accessible, and the lack of partitioning will present an issue in the resilience of the network.
当然,这三个极点——一致性、可访问性和分区容错性——并不像这里所描述的那么明确。在某些情况下,较少的分区可以带来更高的一致性,或者可用性的短期损失将导致一致性的大幅提高。换句话说,CAP 定理并没有真正描述一组三个绝对极点,而是描述一系列可能性中的一组极值点。这样,它很像网络复杂性分析中发现的状态、优化、表面三元组。2
The three poles—consistency, accessibility, and partition tolerance—are not as clear-cut as they have been presented here, of course. There are often situations where less partitioning can result in more consistency, or short-term losses in availability will yield large increases in consistency. In other words, the CAP theorem does not really describe a set of three absolute poles, but rather a set of extreme points across a range of possibilities. In this way it is much like the state, optimization, surface triad found in an analysis of network complexity.2
CAP 定理是考虑控制平面中使用的数据库性能的有用方法。
The CAP theorem is a useful way to think about the performance of the database used in control planes.
检测和分发有关拓扑变化的信息的问题仅次于计算空间中网络上的最短路径问题网络工程的。将问题分为四个步骤(检测、报告、计算和安装)提供了一个框架,您可以使用它来评估各种选项并思考网络真正融合的方式。有两大类解决方案可用:事件驱动和轮询,每种解决方案都有不同的权衡。控制平面通常使用某种形式的记录级复制,以便在发生变化时通过网络携带拓扑信息。
The problem of detecting and distributing information about topology changes is second only to the problem of calculating shortest paths over a network in the space of network engineering. Breaking the problem down into four steps—detection, reporting, calculation, and installation—provides a framework you can use to assess the various options and think through the way a network really converges. Two broad classes of solutions are available, event driven and polling, each with a different set of tradeoffs; control planes normally use some form of record-level replication to carry topology information through the network in the case of a change.
环路和微环路问题在链路状态协议中尤其棘手,距离矢量协议中的丢包也反映了这一问题。这些问题引起了协议设计领域最优秀人士的多年研究。然而,最终,所有这些解决方案都会遇到 CAP 定理的三向权衡。当考虑集中控制平面时,CAP 定理将再次出现。
The problems of loops and microloops have been particularly thorny in link state protocols, mirrored by dropped packets in distance vector protocols. These problems have occasioned years of research on the part of the best minds in protocol design; ultimately, however, all these solutions run up against the three-way tradeoff of the CAP theorem. The CAP theorem will show up again when considering centralized control planes.
接下来的两章将讨论广泛部署的三种基本控制平面——距离向量、链路状态和路径向量。本章和有关单播无环路路径的两章中的材料应该使您能够更容易地理解后续章节中给出的示例的操作。总体而言,了解控制平面需要解决哪些问题以及可用的解决方案将帮助您对任何控制平面提出正确的问题并快速了解其操作。
The next two chapters will consider the three basic kinds of widely deployed control planes—distance vector, link state, and path vector. The material in this chapter and the two chapters on unicast loop-free paths should enable you to more readily understand the operation of the examples given in the following chapters. Overall, understanding what problems a control plane needs to solve, and the solutions available, will help you ask the right questions of any control plane and quickly understand its operation.
巴蒂亚、马纳夫、卡洛斯·皮纳塔罗、山姆·奥尔德林和特里洛克·兰加纳特。用于通告无缝双向转发检测 (S-BFD) 目标鉴别器的 OSPF 扩展。征求意见 7884。RFC 编辑器,2016。https: //rfc-editor.org/rfc/rfc7884.txt。
Bhatia, Manav, Carlos Pignataro, Sam Aldrin, and Trilok Ranganath. OSPF Extensions to Advertise Seamless Bidirectional Forwarding Detection (S-BFD) Target Discriminators. Request for Comments 7884. RFC Editor, 2016. https://rfc-editor.org/rfc/rfc7884.txt.
布莱恩特、斯图尔特、斯特凡诺·普雷维迪、克拉伦斯·菲尔斯菲尔斯、皮埃尔·弗朗索瓦、迈克·尚德和奥利维尔·博纳文图尔。使用有序转发信息库 (oFIB) 方法的无环收敛框架。征求意见 6976。RFC 编辑器,2013。https: //rfc-editor.org/rfc/rfc6976.txt。
Bryant, Stewart, Stefano Previdi, Clarence Filsfils, Pierre Francois, Mike Shand, and Olivier Bonaventure. Framework for Loop-Free Convergence Using the Ordered Forwarding Information Base (oFIB) Approach. Request for Comments 6976. RFC Editor, 2013. https://rfc-editor.org/rfc/rfc6976.txt.
吉尔伯特、赛斯和南希·A·林奇。“对 CAP 定理的看法。” 计算机45 (2011):30–36。doi:doi.ieeecomputersociety.org/10.1109/MC.2011.389。
Gilbert, Seth, and Nancy A. Lynch. “Perspectives on the CAP Theorem.” Computer 45 (2011): 30–36. doi:doi.ieeecomputersociety.org/10.1109/MC.2011.389.
黄彭、郭传雄、周立东、Jacob R. Lorch、Yingnong Dang、Murali Chintalapati 和 Randolph Yao。“灰色故障:云规模系统的致命弱点。” 第 16 届操作系统热点研讨会论文集,150–155。HotOS '17。美国纽约州纽约:ACM,2017。doi:10.1145/3102980.3103005。
Huang, Peng, Chuanxiong Guo, Lidong Zhou, Jacob R. Lorch, Yingnong Dang, Murali Chintalapati, and Randolph Yao. “Gray Failure: The Achilles’ Heel of Cloud-Scale Systems.” In Proceedings of the 16th Workshop on Hot Topics in Operating Systems, 150–155. HotOS ’17. New York, NY, USA: ACM, 2017. doi:10.1145/3102980.3103005.
卡茨、戴夫和大卫·沃德。双向转发检测(BFD)。征求意见 5880。RFC 编辑器,2010。https: //rfc-editor.org/rfc/rfc5880.txt。
Katz, Dave, and David Ward. Bidirectional Forwarding Detection (BFD). Request for Comments 5880. RFC Editor, 2010. https://rfc-editor.org/rfc/rfc5880.txt.
Pignataro、Carlos、David Ward、Manav Bhatia、Nobo Akiya 和瞻博网络。
Pignataro, Carlos, David Ward, Manav Bhatia, Nobo Akiya, and Juniper Networks.
无缝双向转发检测 (S-BFD)。征求意见 7880。RFC 编辑器,2016。https: //rfc-editor.org/rfc/rfc7880.txt。
Seamless Bidirectional Forwarding Detection (S-BFD). Request for Comments 7880. RFC Editor, 2016. https://rfc-editor.org/rfc/rfc7880.txt.
怀特、拉斯和杰夫·坦苏拉。应对网络复杂性:利用 SDN、服务虚拟化和服务链的下一代路由。印第安纳州印第安纳波利斯:Addison-Wesley Professional,2015。
White, Russ, and Jeff Tantsura. Navigating Network Complexity: Next-Generation Routing with SDN, Service Virtualization, and Service Chaining. Indianapolis, IN: Addison-Wesley Professional, 2015.
1. 在状态/优化/表面 (SOS) 模型的背景下报告拓扑变化时考虑信息过载的概念。就优化与状态而言,更快地发送信息与更慢的拓扑更改信息有哪些权衡?
1. Consider the concept of information overload in reporting topology changes within the context of the State/Optimization/Surface (SOS) model. What are some of the tradeoffs in sending information more quickly versus slow topology change information more slowly, in terms of optimization versus state?
2. 在状态、优化、表面模型的上下文中考虑轮询和事件驱动的通知。在模型的三个领域 (SOS) 中,至少列出一两个(如果可能的话,更多)每种解决方案的积极和消极方面。
2. Consider polling and event-driven notification within the context of the state, optimization, surface model. List at least one or two, more if possible, positive and negative aspects of each kind of solution in each of the three realms of the model (SOS).
3. 阅读论文“灰色故障:云规模系统的致命弱点”。您认为基于轮询或事件驱动的解决方案最适合解决本文中描述的各类问题吗?为什么?
3. Read the paper “Gray Failure: The Achilles’ Heel of Cloud-Scale Systems.” Do you think a polling-based or event-driven solution would be best for solving the kinds of problems described in the paper? Why?
4. 解释为什么BFD 会话中会引入抖动。
4. Explain why jitter is introduced in BFD sessions.
5. 记录级复制的一种替代方法是两个文件的二进制级复制。例如,rsynch 使用二进制复制来同步两个文件或数据库。为什么网络控制平面不使用二进制复制?
5. One alternative to record-level replication is binary-level replication of two files. For instance, rsynch uses binary replication to synchronize two files or databases. Why would network control planes not use binary replication?
6. 网络拓扑和微环之间有什么关系?微环会在任何拓扑中形成还是仅在环中形成?环的大小是否会影响是否形成微环?
6. What is the relationship between network topology and microloops? Will microloops form in any topology or only rings? Does the size of the ring impact whether or not microloops will form?
1 . Bryant 等人,使用有序转发信息库 (oFIB) 方法的无环收敛框架。
1. Bryant et al., Framework for Loop-Free Convergence Using the Ordered Forwarding Information Base (oFIB) Approach.
2 . 有关复杂性理论和控制平面的更多信息,请参阅 White 和 Tantsura,导航网络复杂性。
2. For more information on complexity theory and control planes, see White and Tantsura, Navigating Network Complexity.
前面的几章已经考虑了分组交换网络的每个控制平面必须解决的三大问题领域以及针对每个问题的一系列解决方案。首先考虑的问题是发现网络拓扑和可达性。第二个是计算网络中的无环路(并且在某些情况下是不相交的)路径。最后一个问题是对拓扑变化的反应,实际上是一系列问题,包括检测和报告控制平面上网络的变化。
The previous several chapters have considered three broad areas of problems every control plane for a packet switched network must solve and a range of solutions for each of those problems. The first problem considered was discovering the network topology and reachability. The second was calculating loop-free (and, in some cases, disjoint) paths through the network. The final problem, reacting to topology changes, is really a set of problems, including detecting and reporting changes to the network across the control plane.
本章将通过研究分组交换网络中用于单播转发的分布式控制平面的一些实现来巩固这些问题和解决方案。选择这里的实现并不是因为它们被广泛使用,而是因为它们代表了前面章节中概述的解决方案中的一系列实现选择。各部分的基本操作每种情况都会考虑协议;本书这一部分的后续章节将深入探讨控制平面中的信息隐藏和其他更高级的主题,因此这里不再讨论。
This chapter will consolidate these problems and solutions by examining a few implementations of distributed control planes used for unicast forwarding in packet switched networks. The implementations here are not chosen because they are widely used, but rather because they represent a range of implementation choices among the solutions outlined in the previous chapters. The basic operation of each protocol is considered in each case; later chapters in this part of the book will delve into information hiding and other more advanced topics in control planes, so they will not be covered here.
本章的第一部分将首先定义各种广泛的控制平面类别,而不是直接深入协议操作。一旦这些广泛的定义消失,将考虑六个分布式单播控制平面。
Rather than diving directly into protocol operation, the first section of this chapter will begin by defining various broad classes of control planes. Once these broad definitions are out of the way, six distributed unicast control planes will be considered.
控制平面通常按两个特征进行分类。首先,根据无环路路径的计算位置(在转发设备上还是在转发设备外)进行划分。然后根据控制平面所携带的有关网络的信息类型来划分实际交换设备直接参与无环路路径计算的控制平面。没有基于用于计算无环路路径的算法进行分类,尽管这通常与控制平面携带的信息类型密切相关。
Control planes are typically classified by two characteristics. First, they are divided based on where the loop-free paths are calculated, whether on the forwarding device or off. Control planes in which the actual switching devices directly participate in the calculation of loop-free paths are then divided up based on the kind of information they carry about the network. There is no classification based on the algorithm used to calculate loop-free paths, although this is often intimately tied to the kind of information carried by the control plane.
虽然集中式控制平面通常与几个(或一个概念上的)控制器相关,这些控制器从每个交换设备收集可达性和拓扑信息,计算一组无环路路径,并将生成的转发表下载到交换设备,但这一概念不那么严格。集中控制平面更一般地意味着在实际转发设备之外的其他地方计算转发信息的某些部分。这可能意味着单个设备或一组设备;它可能意味着在虚拟机中运行的一组进程;这可能意味着计算所有所需的转发信息或(也许)大部分信息。
While centralized control planes are often related to a few (or one, conceptually) controllers gathering the reachability and topology information from each switching device, calculating the set of loop-free paths, and downloading the resulting forwarding table to the switching devices, the concept is much less strict. A centralized control plane more generally just means calculating some part of the forwarding information someplace other than the actual forwarding device. This may mean a single device or a set of devices; it may mean a set of processes running in a virtual machine; it may mean calculating all of the required forwarding information or (perhaps) most of it.
分布式控制平面通常具有三个一般特征:
Distributed control planes are generally marked by three general characteristics:
• 在每个设备上运行的协议,实现在设备之间传输可达性和拓扑信息所需的各种机制
• A protocol running on each device, and that implements the various mechanisms required to transport reachability and topology information between devices
• 在每个设备上实现的一组算法,用于计算一组到已知目的地的无环路路径
• A set of algorithms implemented on each device, used to compute a set of loop-free paths to known destinations
• 能够在每台设备上本地检测可达性和拓扑的变化并做出反应
• The ability to detect and react to changes in reachability and topology locally at each device
在分布式控制平面中,不仅每个数据包逐跳交换,而且每个跳都确定本地到达任何特定目的地的无环路路径集。分布式控制平面通常分为三大类协议:链路状态、距离向量和路径向量。
In distributed control planes, not only is each packet switched hop by hop, but each hop determines the set of loop-free paths to reach any particular destination locally. Distributed control planes are generally divided into three broad classes of protocols: link state, distance vector, and path vector.
在链路状态协议中,每个设备都会通告每个已连接链路的状态,包括可到达的目的地和连接到该链路的邻居。这些信息形成了一个拓扑数据库,其中包含网络中的每个链路、每个节点和每个可到达的目的地,通过该数据库,可以使用 Dijkstra 或 Suurballe 等算法来计算一组无环路或不相交路径。链路状态协议通常会淹没其数据库,因此每个转发设备都有一个与每个其他转发设备同步的副本。
In link state protocols, each device advertises the state of each connected link, including reachable destinations and neighbors attached to the link. This information forms a topology database containing every link, every node, and every reachable destination in the network, across which an algorithm such as Dijkstra’s or Suurballe’s can be used to calculate a set of loop-free or disjoint paths. Link state protocols typically flood their databases so each forwarding device has a copy that is synchronized with every other forwarding device.
在距离矢量协议中,每个设备都会通告一组到已知可到达目的地的距离。该可达性信息由提供矢量信息的特定邻居通告,或者更确切地说,可以到达目的地的方向。距离矢量协议通常实现 Bellman-Ford、Garcia-Luna 的 DUAL 或一些类似的算法来计算通过网络的无环路路径。
In distance vector protocols, each device advertises a set of distances to known reachable destinations. This reachability information is advertised by a particular neighbor that provides the vector information, or rather the direction through which the destination can be reached. Distance vector protocols typically implement either Bellman-Ford, Garcia-Luna’s DUAL, or some similar algorithm to calculate loop-free paths through the network.
在路径矢量协议中,当路由通告通过网络时,到达目的地的路径被逐个节点地记录。可以添加其他信息(例如度量)来表达某种形式的策略,但每条路径的主要无环性质是根据广告通过网络时所采取的实际路径来计算的。
In path vector protocols, the path to reach the destination is recorded as the routing advertisement passes through the network, on a node-by-node basis. Other information may be added, such as metrics, to express some form of policy, but the primary loop-free nature of each path is calculated based on the actual paths advertisements take when passing through the network.
图15-1说明了这三种分布式控制平面。
Figure 15-1 illustrates these three kinds of distributed control planes.
在图 15-1中:
In Figure 15-1:
• 在链路状态示例中,在顶部,每个设备都会向网络中的每个其他设备通告其可以到达的内容。因此,A 向 B、C 和 D 通告可达性;同时,D 通告 2001:db8:3e8:100::/64 以及 C、B 和 A 的可达性。
• In the link state example, at the top, each device advertises what it can reach to every other device in the network. Hence, A advertises reachability to B, C, and D; at the same time, D advertises reachability to 2001:db8:3e8:100::/64 and to C, B, and A.
• 在中间的距离向量示例中,D 向 C 通告 2001:db8:3e8:100::24 的可达性,其本地成本为 1。C 添加 [D,C] 成本并通告 2001 的可达性:db8:3e8:100::64,B 的成本为 2。
• In the distance vector example, in the middle, D advertises reachability to 2001:db8:3e8:100::24 to C with its local cost, which is 1. C adds the [D,C] cost and advertises reachability to 2001:db8:3e8:100::64 with a cost of 2 to B.
• 在路径向量示例中,底部的D 通告通过自身可到达2001:db8:3e8:100::/24。C 收到此广告并将其自身添加到 [D,C]。
• In the path vector example, at the bottom, D advertises reachability to 2001:db8:3e8:100::/24 through itself. C receives this advertisement and adds itself to [D,C].
控制平面并不总是完全适合某一类别,特别是当您进入各种形式的信息隐藏时。例如,一些链路状态协议使用具有聚合信息的距离向量原理,并且路径向量协议通常使用某种形式的距离向量度量排列来增强计算无环路路径时的路径。这些分类(集中式、距离向量、链路状态和路径向量)对于理解和接触网络工程世界非常重要。
Control planes do not always neatly fit into one category or another, particularly when you move into various forms of information hiding. Some link state protocols, for instance, use distance vector principles with aggregated information, and path vector protocols often use some form of distance vector metric arrangement to augment the path in calculating loop-free paths. These classifications—centralized, distance vector, link state, and path vector—are important for understanding and encountering the network engineering world.
生成树协议 (STP) 最初由 Radia Perlman 设计,并于 1985 年在扩展 LAN 中生成树的分布式计算算法中首次进行描述。1 STP 在此考虑的控制平面列表中是独一无二的,因为它最初设计用于支持交换而不是路由。换句话说,STP 旨在支持转发没有生存时间 (TTL) 的数据包,并且交换设备无需进行每跳标头交换。基于STP 交换的数据包在网络中传输时不会发生任何变化。
The Spanning Tree Protocol (STP) was originally designed by Radia Perlman, and first described in 1985 in An Algorithm for Distributed Computation of a Spanning Tree in an Extended LAN.1 STP is unique in the list of control planes considered here because it was originally designed to support switching rather than routing. In other words, STP was designed to support forwarding on packets without a Time to Live (TTL), and without a per hop header swap by the switching device. Packets switched based on the STP are carried through the network without change.
构建无环树的过程如下:
The process of building a loop-free tree is as follows:
1. 每个设备将所有端口置于阻塞模式,以便没有端口转发任何流量,并开始向每个端口通告桥接协议数据单元 (BPDU)。该BPDU包含
1. Each device places all ports in blocked mode so that no port will forward any traffic, and begins advertising Bridge Protocol Data Units (BPDUs) out each port. This BPDU contains
A。广播设备的ID,它是与本地接口媒体访问控制(MAC)地址结合的优先级。
a. The ID of the advertising device, which is a priority combined with a local interface Media Access Control (MAC) address.
b. 候选根桥的ID。这是本地设备所知道的 ID 最小的网桥。如果网络上的每个设备同时启动,则每个设备都会将自己通告为候选根桥,直到它获悉具有较低桥 ID 的其他桥。
b. The ID of the candidate root bridge. This is the bridge with the lowest ID the local device knows about. If every device on the network starts at the same moment, then each device would advertise itself as the candidate root bridge until it learned of other bridges with a lower bridge ID.
2. 接口收到BPDU 后,将BPDU 中包含的根桥ID 与本地存储的最低根桥ID 进行比较。如果BPDU中包含的根桥ID较低,则将本地存储的根桥ID替换为新发现的具有较低ID的桥。
2. On receiving a BPDU on an interface, the root bridge ID contained in the BPDU is compared with the locally stored lowest root bridge ID. If the root bridge ID contained in the BPDU is lower, then the locally stored root bridge ID is replaced with the newly discovered bridge with a lower ID.
3. 经过几轮通告后,每个网桥都应该发现网络中网桥 ID 最小的网桥,并声明该网桥为根桥。
3. After a few rounds of advertisements, every bridge should have discovered the bridge with the lowest bridge ID in the network and declared this bridge to be the root bridge.
A。当所有设备上的所有端口仍处于阻塞状态(不转发流量)时,应该发生这种情况。
a. This should occur while all the ports on all the devices are still in a blocked state (not forwarding traffic).
b. 为了确保在所有端口仍然被阻止时确实发生这种情况,需要设置一个足够长的计时器以允许选举根桥。
b. To make certain this does happen while all the ports are still blocked, a timer is set long enough to allow the root bridge to be elected.
4. 一旦选举出根桥,到根桥的最短路径就确定了。
4. Once the root bridge is elected, the shortest path to the root bridge is determined.
A。每个 BPDU 还包含到达根桥的度量。该度量可以是跳数,但每一跳的成本也可以根据管理变量(例如链路带宽)而变化。
a. Each BPDU also contains a metric to reach the root bridge. This metric may be a hop count, but the cost of each hop can vary based on administrative variables as well, such as the bandwidth of the link.
b. 每个设备确定通过哪个端口具有到达根桥的最低成本路径;这被标记为根端口。
b. Each device determines the port through which it has the lowest cost path to the root bridge; this is marked as the root port.
C。如果有多条路径到达根桥且成本相同,则使用平局断路器;这通常是端口标识符。
c. If there is more than one path to the root bridge with the same cost, a tie breaker is used; this is normally the port identifier.
5. 对于连接两个网桥的任何链路
5. For any link on which two bridges are connected
A。选择到根桥的路径成本最低的桥来将流量从链路转发到根桥。
a. The bridge with the lowest cost path to the root bridge is elected to forward traffic off the link toward the root bridge.
b. 将所选转发器连接到链路的端口被标记为指定端口。
b. The port connecting the elected forwarder to the link is marked as the designated port.
6. 标记为根端口或指定端口的端口允许转发流量。
6. Ports marked as either root or designated ports are allowed to forward traffic.
此过程的结果是一棵树,通过它可以到达网络中的每个目的地。STP在实际拓扑中的工作原理如图15-3所示。
The result of this process is a single tree over which every destination in the network is reachable. Figure 15-3 is used to show how STP works in an actual topology.
假设图15-3中的所有设备同时打开。时序上可能存在多种变化,但从 F 的角度来看,通过网络构建一组无环路路径的过程看起来像这样:
Assume all the devices in Figure 15-3 were turned on at the same moment. There are a number of variations possible in timing, but the process of building a set of loop-free paths through the network would look, from F’s perspective, something like this:
1.选举根桥:
1. Elect the root bridge:
A。F向E和D发布BPDU,其ID和候选根桥为32768.0200.0000.6666。
a. F advertises a BPDU to E and D with an ID and a candidate root bridge of 32768.0200.0000.6666.
b. D(假设D没有收到任何BPDU)通告一个ID为28672.0200.0000.4444的候选根桥的BDPU。
b. D (assuming D has not received any BPDUs) advertises a BDPU with an ID and a candidate root bridge of 28672.0200.0000.4444.
C。E(假设E没有收到任何BPDU)通告一个BPDU,其ID和候选根桥为32768.0200.0000.5555。
c. E (assuming E has not received any BPDUs) advertises a BPDU with an ID and a candidate root bridge of 32768.0200.0000.5555.
d. 此时,F将选举D为根桥,并开始通告BPDU,其本地ID和候选根桥设置为D的ID。
d. At this point, F will elect D as the root bridge, and start advertising BPDUs with its local ID and the candidate root bridge set to D’s ID.
e. 在某个时刻,D 和 E 都会收到来自 C 的 BPDU,而 C 的网桥 ID 较低 (24576.0200.0000.3333)。收到此 BPDU 后,它们都会将其候选根桥 ID 设置为 C 的 ID,并向 F 发送新的 BPDU。
e. At some point, D and E will both receive BPDUs from C, which has a lower bridge ID (24576.0200.0000.3333). On receiving this BPDU, they will both set their candidate root bridge ID to C’s ID and send new BPDUs to F.
F。在收到这些新的 BPDU 后,F 将注意到新的候选根桥 ID 低于其先前的候选根桥 ID,然后它将选举 C 作为根桥。
f. On receiving these new BPDUs, F will note the new candidate root bridge ID is lower than its previous candidate root bridge ID, and it will then elect C as the root bridge.
G。经过几轮 BDPU 后,网络中的所有网桥将选举 C 作为根桥。
g. After several rounds of BDPUs, all the bridges in the network will elect C as the root bridge.
2. 通过查找到根的最短路径来标记根端口:
2. Mark the root ports by finding the shortest path to the root:
A。假设每个链接的成本为 1。
a. Assume each link is a cost of 1.
b. D将从C接收一个BDPU,其本地ID和根桥ID为24576.0200.0000.3333,成本为0。
b. D will receive a BDPU from C with a local ID and root bridge ID of 24576.0200.0000.3333 and a cost of 0.
C。D会加上到达C的单跳成本,通告它可以以1到F的成本到达根桥。
c. D will add the cost of reaching C, a single hop, advertising it can reach the root bridge with a cost of 1 to F.
d. E将从C接收一个BDPU,其本地ID和根桥ID为24576.0200.0000.3333,成本为0。
d. E will receive a BDPU from C with a local ID and root bridge ID of 24576.0200.0000.3333 and a cost of 0.
e. E会加上到达C的成本,单跳,通告它可以以1到F的成本到达根桥。
e. E will add the cost of reaching C, a single hop, advertising it can reach the root bridge with a cost of 1 to F.
F。F 现在有两个向根桥发送的广告,成本相等;它必须打破这两条可用路径之间的联系。为此,F 检查广告桥的桥 ID。D 的网桥 ID 低于 E,因此 F 会将其通往 D 的端口标记为根端口。
f. F now has two advertisements toward the root bridge with equal cost; it must break the tie between these two available paths. To do so, F examines the bridge ID of the advertising bridges. D’s bridge ID is lower than E’s, so F will mark its port toward D as its root port.
3. 标记每条链路上的指定端口:
3. Marking the designated ports on each link:
A。F 唯一的另一个端口是通往 E 的。该端口是否应该被阻塞?
a. F’s only other port is toward E. Should this port be blocked?
b. 为了确定这一点,F 将其本地网桥 ID 与 E 的网桥 ID 进行比较。优先级是相同的,因此必须比较本地端口地址才能做出决定。F的本地ID以6666结尾,而E的本地ID以5555结尾,所以E的较低。
b. To determine this, F compares its local bridge ID with E’s bridge ID. The priorities are the same, so the local port addresses must be compared to make the decision. F’s local ID ends in 6666, while E’s ends in 5555, so E’s is lower.
C。F不将通往E的接口标记为指定端口;相反,它将此端口标记为阻塞。
c. F does not mark the interface toward E as a designated port; instead, it marks this port as blocked.
d. E 进行相同的比较,并将其朝向 F 的端口标记为指定端口。
d. E does the same comparison and marks its port toward F as a designated port.
e. D 将其朝向根的成本与 F 朝向根的成本进行比较。
e. D compares its cost toward the root with F’s cost toward the root.
F。D 的成本较低,因此会将其通往 D 的端口标记为指定端口。
f. D’s cost is lower, so it will mark its port toward D as a designated port.
图 15-4显示了这些计算完成后的阻塞端口、指定端口和根端口。
Figure 15-4 illustrates the blocked, designated, and root ports once these calculations are completed.
图15-4中的端口用bp表示阻塞端口,rp表示根端口,dp表示指定端口。该过程的结果是一棵可以到达网络中任何网段的树,因此主机可以连接到网络中的任何网段。关于 STP 的一个有趣的点是,结果是整个拓扑中的一棵树,锚定在根桥上。如果连接到 E 的某个主机向连接到 B 或 F 的主机发送数据包,则该数据包必须经过根桥 C,因为 [F,E] 和 [E,B] 链路上的两个端口之一是被阻止。这不是最有效地利用带宽,但它确实可以防止正常转发期间出现循环数据包。
The ports in Figure 15-4 are marked with bp for blocked port, rp for root port, and dp for designated port. The result of the process is a tree that can reach any segment in the network, and hence the hosts connected to any segment in the network. One interesting point about STP is the result is a single tree across the entire topology, anchored at the root bridge. If some host connected to E sends a packet to a host connected to B or F, the packet must travel through C, the root bridge, because one of the two ports on the [F,E] and [E,B] links is blocked. This is not the most efficient use of bandwidth, but it does prevent looping packets during normal forwarding.
STP 中如何处理邻居发现?邻居发现根本没有从通过网络可靠传输信息的角度来解决。网络中的每个设备都会构建自己的 BPDU;这些BPDU不通过任何设备承载,因此不需要在控制平面中进行端到端的可靠传输。然而,邻居发现用于选择根桥并使用 BPDU 在整个拓扑中构建无环路树。丢包和漏包怎么办?任何运行 STP 的设备都会定期在每条链路上重传其 BPDU(根据重传计时器);运行 STP 的设备需要丢弃一些数据包(根据失效计时器)才能假定其邻居已失败,因此重新开始计算根桥和端口状态。STP 中没有双向连接检查,无论是基于每个邻居还是跨整个路径。也没有任何类型的最大传输单元 (MTU) 检查。STP 通过将 BPDU 与每个节点的本地链路信息相结合来了解拓扑;然而,网络中没有单个节点具有描述整个拓扑的表。
How is neighbor discovery handled in STP? Neighbor discovery is not addressed from the perspective of the reliable transport of information through the network at all. Each device in the network builds its own BPDUs; these BPDUs are not carried through any device, so there is no need for end-to-end reliable transport in the control plane. Neighbor discovery is used, however, to elect a root bridge and to build a loop-free tree across the entire topology using BPDUs. What about dropped and missed packets? Any device running STP retransmits its BPDUs on every link periodically (according to a retransmission timer); it takes a few dropped packets (according to a dead timer) for a device running STP to assume its neighbors have failed, and hence to restart calculating the root bridge and port statuses. There is no two-way connectivity check in STP, either on a per neighbor basis or across the entire path. Nor is there any Maximum Transmission Unit (MTU) check of any kind. STP learns about the topology by combining BPDUs with local link information on a per node basis; there is no single node in the network with a table describing the entire topology, however.
STP如何实现转发?更具体地说,运行 STP 的设备如何了解可到达的目的地?用图15-5来说明。
How does STP enable forwarding? More specifically, how do devices running STP learn about reachable destinations? Figure 15-5 is used to explain.
图 15-5显示了计算生成树并将每个端口标记为指定端口或根端口后的网络状态。由于不存在环路,因此此拓扑中不存在阻塞端口。假设 B、C 和 D 没有有关所连接设备的信息;A 向 E 发送数据包。此时会发生什么?
Figure 15-5 shows the state of the network with the spanning tree calculated and each port marked as a designated or root port. There are no blocked ports in this topology because there are no loops. Assume B, C, and D have no information about attached devices; A sends a packet toward E. What happens at this point?
1. A 将数据包传输到 [A,B] 链路上。由于 B 在该链路上有指定端口,因此它将接受数据包(交换机接受指定端口上的所有数据包)并检查源地址和目标地址。
1. A transmits the packet onto the [A,B] link. As B has a designated port on this link, it will accept the packet (switches accept all packets on designated ports) and examine the source and destination addresses.
2. B 可以确定通过该指定端口可以到达 A,因为它在该端口上收到了来自 A 的数据包。基于此,B 将通过其接口将 A 的 MAC 地址插入到其转发表中,使其可达 [A,B] 链路。
2. B can determine that A is reachable through this designated port because it has received a packet from A on this port. Based on this, B will insert A’s MAC address as reachable in its forwarding table through its interface onto the [A,B] link.
3、B没有关于E的任何信息;因此,它将将该数据包从其每个非阻塞端口中淹没。在这种情况下,B 唯一拥有的其他端口是其根端口,因此 B 会将此数据包转发到 C。这种洪泛称为广播、未知和多播 (BUM) 流量;BUM 流量是每个在转发过程中了解目的地的控制平面都必须以某种方式进行管理的流量。
3. B does not have any information about E; therefore it will flood this packet out every one of its nonblocked ports. In this case, the only other port B has is its root port, so B will forward this packet toward C. This flooding is called Broadcast, Unknown, and Multicast (BUM) traffic; BUM traffic is something every control plane that learns destinations during the forwarding process must manage in some way.
4. 当 C 收到此数据包时,它将检查源地址并发现可以通过连接到 [B,C] 的指定端口到达 A。它将将此信息插入到其本地转发表中。
4. When C receives this packet, it will examine the source address and discover that A is reachable through the designated port attached to [B,C]. It will insert this information into its local forwarding table.
5. C 也没有关于 E 在网络上的位置的信息,因此它只会在所有非阻塞端口上洪泛数据包。在这种情况下,C 拥有的唯一其他端口位于 [C,D] 链路上。
5. C also has no information about where E is located on the network, so it will simply flood the packet on all nonblocked ports. In this case, the only other port C has is onto the [C,D] link.
6. D 重复 B 和 C 遵循的相同过程,获知 A 可通过其根端口到达 [C,D] 链路,并将数据包洪泛到 [D,E] 链路上。
6. D repeats the same process B and C have followed, learning that A is reachable through its root port onto the [C,D] link and flooding the packet onto the [D,E] link.
7. 当 E 收到数据包时,它会处理信息并向 A 发回回复。
7. When E receives the packet, it processes the information and sends a reply back toward A.
8. 当 D 收到来自 E 的回复数据包时,它将检查源地址并发现 E 在 [D,E] 链路上的指定端口上是可达的。D 确实知道返回 A 的路径,因为它在处理从 A 到 E 的流中的第一个数据包时发现了此信息。它将在其转发表中查找 A 并将数据包传输到 [C,D] 链路。
8. When D receives this reply packet from E, it will examine the source address and discover E is reachable on its designated port onto the [D,E] link. D does know the path back to A, as it discovered this information in processing the first packet in the flow traveling from A to E. It will look up A in its forwarding table and transmit the packet onto the [C,D] link.
9. C 和 B 将重复 D 和 C 用于发现 E 位置并将返回流量转发回 A 的过程。
9. C and B will repeat the process D and C have used to discover the location of E and to forward the return traffic back to A.
通过这种方式,从传入数据包中学习源地址,并将数据包洪泛或转发到传出链路上,网络中的每个设备都可以了解每个可到达的目的地。由于 STP 依赖于根据网络上传输的数据包来学习可到达的目的地,因此它被归类为反应性控制平面。请注意,此学习过程是在主机级别进行的;子网和互联网协议 (IP) 地址不是被学习的,而是主机接口的物理地址。如果单个主机在同一线路上有两个物理接口,则对于 STP 控制平面来说,它将显示为两个不同的主机。
In this way—learning the source address from incoming packets, and either flooding or forwarding packets onto outgoing links—every device in the network can learn about every reachable destination. Because STP relies on learning reachable destinations in reaction to packets being transmitted on the network, it is classified as a reactive control plane. Note this learning process is at the host level; subnets and Internet Protocol (IP) addresses are not learned, but rather the physical address of the host interface. If a single host has two physical interfaces onto the same wire, it will appear as two different hosts to the STP control plane.
如何从每台设备的转发表中删除信息?通过一个超时过程。如果转发条目在特定时间(保持计时器)内没有被使用,则该条目将从表中删除。因此,STP 依赖于缓存的转发信息。
How is information removed from the forwarding tables on each device? Through a timeout process. If a forwarding entry has not been used in a specific time (a hold timer), the entry is removed from the table. Hence, STP relies on cached forwarding information.
STP 显然不是链路状态协议,也不是路径矢量协议。它是距离矢量协议吗?关于如何对协议进行分类的任何混乱都源于在计算最短路径之前对根桥的初始选择。去掉这第一步,可以更容易地将 STP 归类为距离矢量协议,使用分布式贝尔曼-福特算法来计算跨拓扑的无环路路径。初始根桥计算应该做什么?这部分过程确保整个网络中只有一棵最短路径树。所以STP可以被归类为距离矢量协议,它使用贝尔曼-福特算法来计算整个网络中所有目的地的一组最短路径。另一种说法是,STP 跨拓扑计算最短路径树,而不是跨目的地。
STP is clearly not a link state protocol, nor is it a path vector protocol. Is it a distance vector protocol? Any confusion over how to classify the protocol stems from the initial selection of a root bridge before the shortest paths are calculated. Removing this first step, it is easier to classify STP as a distance vector protocol using a distributed form of the Bellman-Ford algorithm to calculate loop-free paths across the topology. What should be done with the initial root bridge calculation? This part of the process ensures there is just one Shortest Path Tree across the entire network. So STP can be classified as a distance vector protocol that uses the Bellman-Ford algorithm to compute a single set of shortest paths for all destinations across the entire network. Another way to put this is STP computes a Shortest Path Tree across the topology, rather than across the destinations.
为什么在整个网络中计算一棵树很重要?这与 STP 学习可达性信息的方式有关:STP 是一个反应式控制平面,根据流经网络的实际数据包来学习可达性。如果每个设备构建一棵以自身为根的单独树,则此反应过程将导致网络拓扑视图不一致,从而导致转发循环。
Why is it important that a single tree be calculated across the entire network? This is related to the way in which STP learns reachability information: STP is a reactive control plane, learning reachability in response to actual packets flowing through the network. If each device built a separate tree rooted at itself, this reactive process would lead to an inconsistent view of the network topology and hence to forwarding loops.
路由信息协议 (RIP) 最初在 1998 年发布的 RFC1058(路由信息协议)中指定。2该协议在一系列更新的 RFC 中进行了更新,包括 RFC2435、 RIP 版本 2、3和 RFC2080, IPv6 的下一代 RIP。4 图15-7用于解释RIP 操作。
The Routing Information Protocol (RIP) was originally specified in RFC1058, Routing Information Protocol, published in 1998.2 The protocol was updated in a series of more recent RFCs, including RFC2435, RIP version 2,3 and RFC2080, RIP Next Generation for IPv6.4 Figure 15-7 is used to explain RIP operation.
RIP 的操作看似简单。在图 15-7中:
The operation of RIP is deceptively simple. In Figure 15-7:
1. A 发现 2001:db8:3e8:100::/64,因为它是在直连接口上配置的。
1. A discovers 2001:db8:3e8:100::/64 because it is configured on a directly attached interface.
2. A 将此目的地添加到其本地路由表中,成本为 1。
2. A adds this destination to its local routing table with a cost of 1.
3. 由于本地路由表中已安装 100::/64,A 会将这个可达目的地(路由)通告给 B 和 C。
3. As 100::/64 is installed in the local routing table, A will advertise this reachable destination (route) to B and C.
4. 当 B 收到此路由时,它将添加入站接口的成本,使经过 A 的路径的成本为 2,并检查其本地表以查找到该目的地的任何成本较低的路由。由于 B 没有到 100::/64 的其他路径,因此它将将该路由添加到其路由表中,并将该路由通告给 E。
4. When B receives this route, it will add the cost of the inbound interface so that the path through A has a cost of 2, and examine its local table for any lower-cost routes to this destination. As B has no other path to 100::/64, it will install the route in its routing table and advertise the route to E.
5. 当 C 收到此路由时,它将添加入站接口的成本,使经过 A 的路径的成本为 2,并检查其本地表以查找到该目的地的任何成本较低的路由。由于 C 没有其他到达 100::/64 的路径,因此它将将该路由添加到其路由表中,并将该路由通告给 D 和 E。
5. When C receives this route, it will add the cost of the inbound interface so that the path through A has a cost of 2, and examine its local table for any lower-cost routes to this destination. As C has no other path to 100::/64, it will install the route in its routing table and advertise the route to D and E.
6. 当 D 收到此路由时,它将添加来自 C 的入站接口的成本,使得经过 C 的路径的成本为 3,并检查其本地表以查找到该目的地的任何较低成本路由。由于 D 没有到 100::/64 的其他路径,因此它将将该路由安装到其路由表中,并将该路由通告给 E。
6. When D receives this route, it will add the cost of the inbound interface from C so that the path through C has a cost of 3, and examine its local table for any lower-cost routes to this destination. As D has no other path to 100::/64, it will install the route into its routing table and advertise the route to E.
7. E 现在将收到同一路线的三份副本;从一到 C 的成本为 3,从一到 B 的成本为 4,从一到 D 的成本为 5。E 将选择通过 C 的路径,成本为 2,将此路径安装到 100::/64进入其本地路由表。
7. E will now receive three copies of the same route; one through C with a cost of 3, one through B with a cost of 4, and one through D with a cost of 5. E will choose the path through C with a cost of 2, installing this path to 100::/64 into its local routing table.
8. E 不会向 C 通告任何到 100::/64 的路径,因为它使用 C 作为到达该特定目的地的最佳路径。因此,E 会将其 100::/64 的广告水平分割给 C。
8. E will not advertise any path to 100::/64 toward C, because it is using C as its best path to reach this specific destination. Thus, E will split horizon its advertisement of 100::/64 toward C.
9. 虽然 E 将通过 C 向 D 和 B 通告其最佳路径,但两者都不会选择通过 E 的路径,因为它们已经有通往 100::/64 的更好的可用路径。
9. While E will advertise its best path, through C, to both D and B, neither will choose the path through E, as they already have better paths available toward 100::/64.
RIP 通告一组目的地,并通过网络一次花费一跳;因此它被认为是距离矢量协议。RIP 用于在网络中查找一组无环路路径的过程被认为是 Bellman-Ford 算法的分布式形式,但 RIP 使用的过程与 Bellman-Ford 算法有何关系并不明显。
RIP advertises a set of destinations and costs one hop at a time through the network; hence it is considered a distance vector protocol. The process that RIP uses to find a set of loop-free paths through the network is considered a distributed form of the Bellman-Ford algorithm, but it is not obvious how the process that RIP is using is related to Bellman-Ford.
要查看连接,最好将网络中的每一跳视为拓扑表中的一行;如图 15-8所示。
To see the connection, it is best to think of each hop in the network as a single row in the topology table; this is illustrated in Figure 15-8.
第 12 章“单播无环路路径 (1) ”描述了 Bellman-Ford 在拓扑表上进行操作,该拓扑表排列为一组列和行。使用图 15-8所示的行号,您可以为此网络构建类似的拓扑表,如表 15-1所示。
Chapter 12, “Unicast Loop-Free Paths (1),” describes Bellman-Ford operating across a topology table, arranged as a set of columns and rows. Using the row numbers indicated in Figure 15-8, you can build a similar topology table for this network, as shown in Table 15-1.
表 15-1根据 图 15-8中的网络构建的拓扑表
Table 15-1 A Topology Table Built from the Network in Figure 15-8
排 Row |
来源 Source (s) |
目的地 (d) Destination (d) |
距离(成本) Distance (cost) |
1 1 |
100::/64 100::/64 |
A A |
1 1 |
2 2 |
A A |
乙 B |
1 1 |
3 3 |
乙 B |
C C |
2 2 |
4 4 |
C C |
D D |
2 2 |
假设表的每一行都由不同的节点通过 Bellman-Ford 算法运行。例如,A 跨第一行计算 Bellman-Ford 并将结果传递给 B。同样,B 跨相关行计算 Bellman-Ford 并将结果传递给 C。Bellman-Ford 仍然是用于计算的算法通过网络的无环路径集;它只是分布在网络中的节点上。事实上,这就是 RIP 的运作方式。考虑以下:
Assume each row of the table is run through the Bellman-Ford algorithm by a different node. For instance, A computes Bellman-Ford across the first row and passes the result on to B. Likewise, B computes Bellman-Ford across the relevant rows and passes the result on to C. Bellman-Ford would still be the algorithm used to compute the set of loop-free paths through the network; it would simply be distributed across the nodes in the network. This, in fact, is how RIP operates. Consider the following:
1. A 计算表中的第一行,将 100::/64 的前驱设置为 A,并将成本设置为 1。A 将此结果传递给 B 进行第二轮处理。
1. A computes the first row in the table, setting the predecessor for 100::/64 to A and the cost to 1. A passes this result on to B for the second round of processing.
2. B 处理表中的第二行,将 100::/64 的前驱设置为 B,并将成本设置为 2。B 将这个结果传递给 C 进行第三轮处理。
2. B processes the second row in the table, setting the predecessor for 100::/64 to B and the cost to 2. B passes this result on to C for the third round of processing.
3. C 处理表中的第二行,将 100::/64 的前驱设置为 C,并将成本设置为 2。C 将此结果传递给 D。
3. C processes the second row in the table, setting the predecessor for 100::/64 to C and the cost to 2. C passes this result on to D.
Bellman-Ford 分布式处理在更复杂的拓扑中更难看到,因为有多个“结果表”在网络中传递。然而,这些“结果表”最终将在源节点合并。图 15-9说明了这一点。
The Bellman-Ford distributed processing is more difficult to see in more complex topologies, because there is more than one “result table” being passed around the network. These “result tables” will eventually merge at the source node, however. Figure 15-9 illustrates.
在图15-9中,A将计算临时结果表作为贝尔曼-福特算法的第一“轮”,并将结果传递给B和E。B将根据本地信息计算临时结果,并将其传递给B和E。到C,然后C到D。以同样的方式,E会根据本地信息计算临时结果表,并将其传递给F,然后F到D。在D,两个临时结果被组合成最终结果从 D 的角度看表。当然,临时表对于每一跳的设备来说被认为是最终的。从E的角度来看,它根据本地可用信息加上A的通告计算出的表是到达100::/64的最终无环路径表。
In Figure 15-9, A would compute a provisional result table as the first “round” of the Bellman-Ford algorithm, passing the result on to both B and E. B would compute a provisional result based on local information, passing this on to C, and then C to D. In the same way, E would compute a provisional result table based on local information, passing this on to F, and then F to D. At D, the two provisional results are combined into a final table from D’s perspective. Of course, the provisional table is considered final for the device at each hop. From E’s perspective, the table it computes based on locally available information plus the advertisement from A is the final table of loop-free paths to reach 100::/64.
整个分布式过程的效果与遍历拓扑表中的每一行与拓扑表本身中的条目相同的次数相同,根据上一轮新设置的前驱慢慢地对每个条目的前驱和成本字段进行排序。计算。
The entire distributed process has the same effect as walking across every row in the topology table the same number of times as entries in the topology table itself, slowly sorting the predecessor and cost fields for each entry based on newly set predecessors in the previous round of computation.
当节点或链路发生故障时,RIP 如何从网络中删除可达性信息?用图15-10来说明。
How does RIP remove reachability information from the network in the case of a node or link failure? Figure 15-10 is used to explain.
对于 [A,B] 链路丢失,可能有两种不同的反应,具体取决于该网络中运行的 RIP 的版本和配置。第一个可能的反应是简单地让有关 100::/64 的信息超时。假设任何给定路由的无效计时器(保持计时器的一种形式)为 180 秒(RIP 实现中的常见设置):
There are two different possible reactions to the loss of the [A,B] link, depending on the version and configuration of RIP running in this network. The first possible reaction is to simply let the information about 100::/64 time out. Assuming the invalid timer (a form of hold timer) for any given route is 180 seconds (a common setting in RIP implementations):
• B 会立即注意到故障链路(因为它是直接连接的),并从其本地路由表中删除 100::/64。
• B would notice the failed link immediately, as it is directly connected, and remove 100::/64 from its local routing table.
• B 将停止向 C 通告 100::/64 的可达性。
• B would stop advertising reachability to 100::/64 toward C.
• 在 B 停止通告 100::/64 的可达性后 180 秒,C 将从其本地路由表中删除对此目的地的可达性,并停止向 D 通告 100::/64 的可达性。
• C will remove reachability to this destination from its local routing table and stop advertising reachability toward 100::/64 to D 180 seconds after B stops advertising reachability to 100::/64.
• 在 C 停止通告 100::/64 的可达性后 180 秒,D 将从其本地路由表中删除对该目的地的可达性。
• D will remove reachability to this destination from its local routing table 180 seconds after C stops advertising reachability to 100::/64.
至此,网络已经收敛到新的拓扑信息。这显然是一个相当慢的过程,因为每一跳都必须等待距离目的地较近的每个路由器使目的地超时,然后才能发现连接丢失。
At this point, the network has converged on the new topology information. This is obviously a rather slow process, as each hop must wait for every router closer to the destination to time the destination out before discovering the loss of connectivity.
为了加快这一过程,大多数 RIP 实施还包括触发更新。如果在此网络中实现并部署了触发更新,则当 [A,B] 链路发生故障(或从服务中删除)时,B 将从其本地表中删除 100::/64 的可达性,并向 C 发送触发更新,通知 C 无法到达目的地。这种触发的更新通常采用具有无限度量的广告的形式,或者更确切地说,采用所谓的毒物逆转。触发的更新通常是有节奏的,因此抖动的链路不会导致触发的更新本身压垮链路或相邻路由器。
To speed up this process, most RIP implementations also include triggered updates. If triggered updates are implemented and deployed in this network, when the [A,B] link fails (or is removed from service), B will remove reachability to 100::/64 from its local table and send a triggered update to C, informing C of the failed reachability toward the destination. This triggered update generally takes the form of an advertisement with an infinite metric, or rather what is known as a poison reverse. Triggered updates are often paced, so a flapping link will not cause the triggered updates themselves to overwhelm either a link or a neighboring router.
RIP 中指定了另外两个定时器供收敛期间使用:刷新定时器和抑制定时器。当路由超时(如上所述)时,它不会立即从本地路由表中删除。相反,设置另一个计时器来确定何时从本地表中清除路由。这是冲洗计时器。此外,存在一个单独的时间段,在此期间任何具有比先前已知的度量更差的度量的路由将不被接受。这是保持计时器。
Two other timers are specified in RIP for use during convergence: the flush timer and the hold-down timer. When a route times out (as described above), it is not immediately removed from the local routing table. Rather, another timer is set that determines when the route will be flushed from the local table. This is the flush timer. Further, there is a separate time period during which any route with a worse metric than the previously known metric will not be accepted. This is the hold-down timer.
RIP 向邻居传送有关本地可到达目的地的信息,以及每个目的地的成本;因此它是一个距离矢量协议。通过每个设备的本地信息获知可达目的地,并携带无论流量如何,都通过协议通过网络;因此,RIP 是一个主动控制平面。
RIP carries information about locally reachable destinations to neighbors, along with a cost for each destination; hence it is a distance vector protocol. Reachable destinations are learned through local information at each device, and carried through the network by the protocol regardless of traffic flow; hence RIP is a proactive control plane.
RIP 不形成邻接关系,以便通过网络可靠传输数据;相反,RIP 依靠定期传输的更新来确保信息不会过时或意外丢失。任何一条信息的保留时间量基于保持定时器,传输频率基于更新定时器;保持定时器通常设置为更新定时器值的三倍。
RIP does not form adjacencies for the reliable transmission of data through the network; rather, RIP relies on periodically transmitted updates to ensure information has not become out of date or has been accidentally dropped. The amount of time any piece of information is kept is based on a hold timer, and the frequency of transmissions is based on an update timer; the hold timer is normally set to three times the value of the update timer.
由于RIP没有真正的邻接过程,因此它不检测是否存在双向连通性;因此没有双向连接检查(TWCC)。RIP 中也没有内置检查两个邻居之间 MTU 的方法。
As RIP has no true adjacency process, it does not detect whether or not two-way connectivity exists; hence there is no Two-Way Connectivity Check (TWCC). No method to check the MTU between two neighbors is built into RIP, either.
增强型内部网关路由协议 (EIGRP) 最初于 1993 年发布,用于取代 Cisco 的内部网关路由协议 (IGRP)。取代 IGRP 的主要原因是它无法承载有类路由信息;具体来说,IGRP 不能携带子网掩码。思科的工程师(特别是 Dino Farinacci 和 Bob Albrightson)没有重建协议来支持前缀长度,而是决定构建一个基于 Garcia-Luna 的扩散更新算法 (DUAL) 的新协议。Dave Katz 重建了交通,以解决 20 世纪 90 年代中期广泛遇到的一些问题。基于这个初始实现,Donnie Savage 领导的团队在 2000 年代对该协议的操作进行了大幅修改,添加了许多扩展功能,并重写了 EIGRP 对拓扑变化反应的关键部分。EIGRP 发布,
The Enhanced Interior Gateway Routing Protocol (EIGRP) was originally released in 1993 to replace Cisco’s Interior Gateway Routing Protocol (IGRP). The primary reason for replacing IGRP was its inability to carry classful routing information; specifically, IGRP could not carry subnet masks. Rather than rebuild the protocol to support prefix lengths, engineers at Cisco (specifically Dino Farinacci and Bob Albrightson) decided to build a new protocol based on Garcia-Luna’s Diffusing Update Algorithm (DUAL). Dave Katz rebuilt the transport to resolve some widely encountered problems in the mid-1990s. Based on this initial implementation, a team led by Donnie Savage modified the operation of the protocol heavily in the 2000s, adding a number of scaling features, and rewriting key parts of EIGRP’s reaction to topology changes. EIGRP was released, along with virtually all of these enhancements, in the informational RFC7868 in 2013.
虽然 EIGRP 通常不被考虑在大多数服务提供商网络中进行主动部署(大多数运营商更喜欢链路状态协议),但 DUAL 在围绕无环路路径的讨论中引入了一些重要概念。DUAL 还用于其他协议,例如 BABEL(在 RFC6126 中指定,用于轻量级无线电和家庭网络环境)。
While EIGRP is not often considered for active deployment in most service provider networks (most operators prefer a link state protocol instead), DUAL introduces some important concepts into the conversation around loop-free paths. DUAL is also used in other protocols, such as BABEL (specified in RFC6126, and used in lightweight radio and home network environments).
笔记
Note
在整个 EIGRP 讨论中,假设每个链路的带宽设置为 1,000,K 值设置为其默认值,使延迟成为影响度量的唯一因素。鉴于此,在这些示例中,仅使用延迟值作为度量来简化数学计算。
Throughout this discussion of EIGRP, the bandwidth of every link is assumed to be set to 1,000, and the K values set to their default values, leaving the delay as the only component impacting the metric. Given this, the delay value alone is used as the metric in these examples to simplify the math.
图15-12用于描述EIGRP的操作。
Figure 15-12 is used to describe the operation of EIGRP.
EIGRP 在该网络中的操作表面上非常简单:
The operation of EIGRP in this network is very simple on the surface:
1. A 发现 2001:db8:3e8:100::/64,因为它是直接连接的(例如,这可以通过接口配置)。
1. A discovers 2001:db8:3e8:100::/64 because it is directly attached (this could be through the interface configuration, for instance).
2. A 将入站接口成本(此处显示为延迟 100)添加到路由中,并将其安装到其本地路由表中。
2. A adds the inbound interface cost, here shown as a delay of 100, to the route, and installs it in its local routing table.
3. A 通过另外两个连接的接口向 B 和 C 通告 100::/64。
3. A advertises 100::/64 to B and C through the two other connected interfaces.
4. B 接收此路由,添加入站接口成本(总延迟为 200),并检查其本地表以查找到此目的地的任何其他(或更好)路由;B 没有到 100::/64 的路由,因此它将该路由安装到其本地路由表中。
4. B receives this route, adds the inbound interface cost (for a total delay of 200), and examines its local table for any other (or better) routes to this destination; B does not have a route to 100::/64, so it installs the route in its local routing table.
5. B 向 D 通告 100::/64。
5. B advertises 100::/64 to D.
6. C 接收此路由,添加入站接口成本(总延迟为 200),并检查其本地表以查找到此目的地的任何其他(或更好)路由;C 没有到 100::/64 的路由,因此它将该路由安装到其本地路由表中。
6. C receives this route, adds the inbound interface cost (for a total delay of 200), and examines its local table for any other (or better) routes to this destination; C does not have a route to 100::/64, so it installs the route in its local routing table.
7. C 向 D 通告 100::/64。
7. C advertises 100::/64 to D.
8. D 从 B 接收到 100::/64 的路由,添加入站接口成本(总延迟为 300),并检查其本地表以查找到该目的地的任何其他(或更好)路由;D 没有到该目的地的路由,因此它将该路由安装到其本地路由表中。
8. D receives the route to 100::/64 from B, adds the inbound interface cost (for a total delay of 300), and examines its local table for any other (or better) routes to this destination; D does not have a route to this destination, so it installs the route in its local routing table.
9. D 从 C 接收到 100::/64 的路由,添加入站接口成本(总延迟为 400),并检查其表以查找到该目的地的任何其他(或更好)路由;D 确实有一条通过 B 到达 100::/64 的更好路由,因此它会将新路由插入到其本地拓扑表中(请参阅下文了解 D 在此备用路径上执行的附加处理)。
9. D receives the route to 100::/64 from C, adds the inbound interface cost (for a total delay of 400), and examines its table for any other (or better) routes to this destination; D does have a better route to 100::/64, through B, so it inserts the new route into its local topology table (see below for the additional processing D does on this alternate path).
10. D将100::/64的路由发布给E。
10. D advertises the route to 100::/64 to E.
11. E 从 D 接收到 100::/64 的路由,添加入站接口成本(总延迟为 400),并检查其本地表以查找到该目的地的任何其他(或更好)路由;E 没有到该目的地的路由,因此它将该路由安装到其本地路由表中。
11. E receives the route to 100::/64 from D, adds the inbound interface cost (for a total delay of 400), and examines its local table for any other (or better) routes to this destination; E does not have a route to this destination, so it installs the route in its local routing table.
到目前为止,这与 RIP 的操作非常相似。然而,第 9 步需要更多细节。在第8步之后,D有一条到100::/64的路径,总成本为300;这是到目的地的可行距离,B 是后继路径,因为它是成本最低的路径。在步骤 9,D 接收到同一目的地的第二条路径。在 RIP 或其他 Bellman-Ford 实现中,第二条路径将被忽略或丢弃。然而,基于 DUAL 的 EIGRP 将检查第二条路径以确定它是否无环路。如果主路径出现故障,还可以使用该路径吗?
Thus far, this is very similar to the operation of RIP. Step 9, however, needs a good bit more detail. After step 8, D has a path to 100::/64 with a total cost of 300; this is the feasible distance to the destination, and B is the successor, as it is the path with the lowest-cost path. At step 9, D receives a second path to this same destination. In RIP, or other Bellman-Ford implementations, this second path would either be ignored or discarded. EIGRP, being grounded in DUAL, however, will examine this second path to determine if it is loop free or not. Can this path be used if the primary path fails?
为了确定该备用路径是否无环路,D 必须将可行距离与 C 报告的距离进行比较,作为其达到 100::/64(报告的距离)的成本。此信息可在 D 从 C 收到的通告中找到(请记住,C 通告的路由及其到目的地的成本;D 将 [B,D] 链路的成本添加到此中,以找到通过 C 的总成本,达到 100: :/64)。通过 C 的报告距离为 200,小于本地可行距离 300。因此,通过 C 的路由是无环路并被标记为可行后继。
To determine whether this alternate path is loop free or not, D must compare the feasible distance with the distance C has reported as its cost to reach 100::/64— the reported distance. This information is available in the advertisement D receives from C (remember that C advertises the route with its cost to the destination; D adds the cost of the [B,D] link to this to find the total cost through C to reach 100::/64). The reported distance through C is 200, which is less than the local feasible distance, which is 300. Hence, the route through C is loop free and is marked as a feasible successor.
如何使用这些可行的后继者?假设[B,D]链路出现故障,如图15-13所示。
How are these feasible successors used? Assume the [B,D] link fails, as illustrated in Figure 15-13.
当该链路发生故障时,D 将检查其本地拓扑表以发现是否有另一条无环路径到达目的地。由于通过 C 的路径被标记为可行后继路径,因此 D 确实有一条替代路径。在这种情况下,D 可以简单地切换到使用通过 C 的路径来到达 100::/64。在这种情况下, D不会重新计算可行距离,因为它没有收到任何有关网络拓扑的新信息。
When this link fails, D will examine its local topology table to discover if it has another loop-free path to the destination. Since the path through C is marked as a feasible successor, D does have an alternate path. In this case, D can simply switch to using the path through C to reach 100::/64. D will not recalculate the feasible distance in this case, as it has not received any new information about the network topology.
如果C和A之间的链路出现故障怎么办,如图15-14所示?
What if the link between C and A fails, instead, as illustrated in Figure 15-14?
图 15-14 在 EIGRP 中对没有可行后继者的故障做出反应
Figure 15-14 Reacting to Failure Without a Feasible Successor in EIGRP
在这种情况下,在故障之前,C 有两条到达 100::/64 的路径:一条通过 A,总延迟为 200,另一条通过 D,总延迟为 500。C 处的可行距离将设置为 200 ,因为这是收敛完成时可用的最佳路径的成本。D 处报告的距离 300 大于 C 处的可行距离,因此 C 不会将经过 D 的路径标记为可行后继路径。一旦[A,C]链路出现故障,由于C没有备用路径,因此会将路由标记为活动状态并向其每个邻居发送查询,请求有关 100::/64 的任何可用路径的更新信息。
In this case, before the failure, C has two paths to 100::/64: one through A with a total delay of 200 and a second through D with a total delay of 500. The feasible distance at C will be set to 200, as this is the cost of the best path available when convergence is complete. The reported distance at D, 300, is greater than the feasible distance at C, so C will not mark the path through D as a feasible successor. Once the [A,C] link fails, since C does not have an alternate path, it will mark the route active and send a query to each of its neighbors requesting updated information about any available path to 100::/64.
当 D 收到此查询时,它将检查其本地拓扑表并发现通往 100::/64 的最佳路径仍然可用。由于该路径仍然存在,D 上的 EIGRP 进程可以假设通过 B 的当前最佳路径尚未受到 [A,C] 链路故障的影响。D 使用其当前度量回复此查询,这表明此路径仍然可用,并且从 D 的角度来看是无循环的。
When D receives this query, it will examine its local topology table and find that its best path toward 100::/64 is still available. Because this path still exists, the EIGRP process on D can assume that the current best path, through B, has not been impacted by the failure of the [A,C] link. D replies to this query with its current metric, which indicates this path is still available, and is loop free from D’s perspective.
收到此回复后,C 将注意到它没有等待任何其他邻居响应(因为它只有一个邻居 D)。由于 C 已收到它正在等待的所有答复,它将重新计算可用的无环路路径,选择 D 作为后继路径,并通过 D 的成本作为可行距离。
On receiving this reply, C will note it is not waiting on any other neighbors to respond (as it has just one neighbor, D). As C has received all the replies it is waiting on, it will recalculate the available loop-free paths, choosing D as the successor, and the cost through D as the feasible distance.
如果 D 从未响应 C 的查询,会发生什么情况?在较早的 EIGRP 实现中,C 将设置一个计时器,称为“ Stuck in Active Timer”;如果 D 在此时间内没有响应 C 的查询,C 将声明路由处于活动状态 (SIA),并重置其与 D 的邻居邻接关系。在 EIGRP 的较新实现中,C 将设置一个称为SIA 查询计时器的计时器。当这个定时器到期时,它会重新向D发送查询。只要D响应它仍在回答查询,C就会继续等待响应。
What happens if D never responds to C’s query? In older EIGRP implementations, C would set a timer, called the Stuck in Active Timer; if D does not respond to C’s query within this time, C will declare the route Stuck in Active (SIA) and reset its neighbor adjacency with D. In newer implementations of EIGRP, C will set a timer called the SIA Query timer. When this timer expires, it will resend the query to D. So long as D responds that it is still working on answering the query, C will continue to wait for a response.
这些查询在哪里终止?EIGRP 查询将在网络中传播多远?EIGRP 查询在以下两个点之一终止:
Where do these queries terminate? How far will an EIGRP query propagate in a network? EIGRP queries terminate at one of two points:
• 当路由器没有其他邻居可以发送查询时
• When a router has no other neighbors to send queries to
• 当接收查询的路由器没有任何关于查询引用的目的地的信息时
• When the router receiving the query does not have any information about the destination referenced by the query
这意味着要么位于“EIGRP 网络的末端”(称为自治系统),要么位于隐藏特定目的地信息的任何类型策略或配置之外的一台路由器;例如,超出路由聚合点的一跳。
This means either at the “end of the EIGRP network” (called an Autonomous System), or one router beyond any sort of policy or configuration that hides information about specific destinations; for instance, one hop beyond the point where a route is aggregated.
EIGRP 检查邻居之间的双向连接(链路 MTU),并通过形成邻居关系来通过网络提供控制平面信息的可靠传输。EIGRP邻居形成过程如图15-15所示。
EIGRP checks two-way connectivity between neighbors, the link MTU, and provides for the reliable transport of control plane information through the network by forming neighbor relationships. Figure 15-15 illustrates the EIGRP neighbor formation process.
如图15-15所示,步骤如下:
The steps illustrated in Figure 15-15 are as follows:
1. A 在 A 和 B 之间共享的链路上发送多播 hello。
1. A sends a multicast hello onto the link shared between A and B.
2. B 将 A 置于待处理状态;当 A 处于挂起状态时,B 不会向 A 发送标准更新或查询,也不会接受来自 A 的特殊格式更新以外的任何内容。
2. B places A in pending state; while A is in pending state, B will not send standard updates or queries to A, nor will it accept anything other than a specially formatted update from A.
3. B 发送一个空更新,并将初始化位设置为 A;该数据包被发送到A的单播接口地址。
3. B transmits an empty update with the initialization bit set to A; this packet is sent to A’s unicast interface address.
4. 收到此更新后,A 使用空更新进行响应,其中设置了初始化位并包含确认;该数据包被发送到 B 的单播接口地址。
4. On receiving this update, A responds with an empty update with the initialization bit set and containing an acknowledgment; this packet is sent to B’s unicast interface address.
5. 收到此单播更新后,B 将 A 置于连接状态,并开始向 A 发送包含各个拓扑表条目的更新;每个数据包上附带的是对从邻居收到的前一个数据包的确认。
5. On receiving this unicast update, B places A into the connected state and begins sending updates containing individual topology table entries toward A; piggy-backed onto each packet is an acknowledgment for the previous packet received from the neighbor.
由于 EIGRP 不与邻居组形成邻接关系,而仅与单个邻居形成邻接关系,因此此过程可确保形成邻接关系的两个路由器之间的单播和组播可达性均可用。为了确保链路两端的 MTU 不会不匹配,EIGRP 在邻居形成期间填充一组特定的数据包;如果其他路由器没有收到这些数据包,则 MTU 不匹配,不应形成邻居关系。
Because EIGRP does not form adjacencies with sets of neighbors, only individual neighbors, this process ensures that both unicast and multicast reachability are available between the two routers forming an adjacency. To ensure that the MTU is not mismatched on either end of the link, EIGRP pads a specific set of packets during the neighbor formation; if these packets are not received by the other router, the MTU is mismatched, and no neighbor relationship should be formed.
笔记
Note
默认情况下,EIGRP 发送多播 hello 进行邻居发现,但如果手动配置邻居,则将使用单播 hello。
EIGRP sends multicast hellos for neighbor discovery by default but will use unicast hellos if neighbors are manually configured.
EIGRP 针对路由协议在通过网络发送信息、计算无环路路径以及对拓扑变化做出反应时遇到的问题提供了许多有趣的解决方案。EIGRP 被归类为距离矢量协议,使用 DUAL 计算网络中的无环路路径和备用无环路路径。EIGRP 通告路由时不参考通过网络的流量,因此它是一种主动协议。
EIGRP presents a number of interesting solutions to the problems that routing protocols encounter when sending information across a network, calculating loop-free paths, and reacting to topology changes. EIGRP is classified as a distance vector protocol using DUAL to calculate loop-free paths, and alternate loop-free paths, through the network. EIGRP advertises routes without reference to traffic flows through the network, so it is a proactive protocol.
贝尔曼、理查德. “关于路由问题。” 应用数学季刊16(1958):87-90。
Bellman, Richard. “On a Routing Problem.” Quarterly of Applied Mathematics 16 (1958): 87–90.
Dijkstra,EW“关于与图相关的两个问题的注释”。数值数学1,no。1(1959):269-71。doi:10.1007/BF01386390。
Dijkstra, E. W. “A Note on Two Problems in Connexion with Graphs.” Numerische Mathematik 1, no. 1 (1959): 269–71. doi:10.1007/BF01386390.
恩维迪、加博尔·桑德尔、安德拉斯·萨斯扎尔、阿莉亚·阿特拉斯、克里斯·鲍尔斯和阿比舍克·戈帕兰。一种使用最大冗余树计算 IP/LDP 快速重路由的算法 (MRT-FRR)。征求意见 7811。RFC 编辑器,2016。https: //rfc-editor.org/rfc/rfc7811.txt。
Envedi, Gabor Sandor, Andras Csaszar, Alia Atlas, Chris Bowers, and Abishek Gopalan. An Algorithm for Computing IP/LDP Fast Reroute Using Maximally Redundant Trees (MRT-FRR). Request for Comments 7811. RFC Editor, 2016. https://rfc-editor.org/rfc/rfc7811.txt.
“eRSTP - 增强型快速生成树协议 - 工业通信 - 西门子。” WCMS3 文章。访问日期:2017 年 9 月 25 日。http: //w3.siemens.com/mcms/industrial-communication/en/rugged-communication/technology-highlights/pages/erstp-enhance-rapid-spanning-tree-protocol.aspx。
“eRSTP—Enhanced Rapid Spanning Tree Protocol—Industrial Communication—Siemens.” WCMS3Article. Accessed September 25, 2017. http://w3.siemens.com/mcms/industrial-communication/en/rugged-communication/technology-highlights/pages/erstp-enhance-rapid-spanning-tree-protocol.aspx.
福特,LR网络流理论。加利福尼亚州圣莫尼卡:兰德公司,1956 年。
Ford, L. R. Network Flow Theory. Santa Monica, CA: RAND Corporation, 1956.
Garcia-Luna-Aceves,JJ“使用扩散计算的无环路由。” IEEE/ACM 网络交易1,编号。1(1993 年 2 月):130-41。
Garcia-Luna-Aceves, J. J. “Loop-Free Routing Using Diffusing Computations.” IEEE/ACM Transactions on Networking 1, no. 1 (February 1993): 130–41.
Hendrick, C.路由信息协议。征求意见 1058。RFC 编辑,1988。https: //rfc-editor.org/rfc/rfc1058.txt。
Hendrick, C. Routing Information Protocol. Request for Comments 1058. RFC Editor, 1988. https://rfc-editor.org/rfc/rfc1058.txt.
马尔金,加里· S。RIP 第 2 版。征求意见 2453。RFC 编辑,1998。https: //rfc-editor.org/rfc/rfc2453.txt。
Malkin, Gary S. RIP Version 2. Request for Comments 2453. RFC Editor, 1998. https://rfc-editor.org/rfc/rfc2453.txt.
马尔金、加里·S.和罗伯特·E·明尼尔。用于 IPv6 的 RIPng。征求意见 2080。RFC 编辑,1997。https: //rfc-editor.org/rfc/rfc2080.txt。
Malkin, Gary S., and Robert E. Minnear. RIPng for IPv6. Request for Comments 2080. RFC Editor, 1997. https://rfc-editor.org/rfc/rfc2080.txt.
摩尔,爱德华·F。《穿过迷宫的最短路径》。1957 年国际开关理论研讨会论文集,第二部分。马萨诸塞州剑桥:哈佛大学出版社,1959 年。
Moore, Edward F. “The Shortest Path through a Maze.” In Proceedings of the International Symposium on Switching Theory 1957, Part II. Cambridge, MA: Harvard University Press, 1959.
帕尔曼、拉迪亚. “扩展 LAN 中生成树的分布式计算算法。” SIGCOMM 计算机通信评论15,编号。4(1985 年 9 月):44-53,doi:10.1145/318951.319004。
Perlman, Radia. “An Algorithm for Distributed Computation of a Spanningtree in an Extended LAN.” SIGCOMM Computer Communication Review 15, no. 4 (September 1985): 44–53, doi:10.1145/318951.319004.
雷塔纳、阿尔瓦罗、拉斯·怀特和唐·斯莱斯。IP 的 EIGRP:基本操作和配置。第一版。马萨诸塞州波士顿:Addison-Wesley Professional,2000 年。
Retana, Alvaro, Russ White, and Don Slice. EIGRP for IP: Basic Operation and Configuration. 1st edition. Boston, MA: Addison-Wesley Professional, 2000.
萨维奇、唐尼、史蒂文·摩尔、詹姆斯·吴、拉斯·怀特、唐纳德·斯莱斯和彼得·帕鲁克。思科的增强型内部网关路由协议 (EIGRP)。征求意见 7868。RFC 编辑器,2016。https: //rfc-editor.org/rfc/rfc7868.txt。
Savage, Donnie, Steven Moore, James Ng, Russ White, Donald Slice, and Peter Paluch. Cisco’s Enhanced Interior Gateway Routing Protocol (EIGRP). Request for Comments 7868. RFC Editor, 2016. https://rfc-editor.org/rfc/rfc7868.txt.
施里弗,亚历山大。“关于最短路径问题的历史。” 数学文献展 (2012):155–67。
Schrijver, Alexander. “On the History of the Shortest Path Problem.” Documenta Mathematica Extra (2012): 155–67.
Shimbel, A.“通信网络的结构”。信息网络研讨会论文集,199-203。纽约:布鲁克林理工学院理工出版社,nd
Shimbel, A. “Structure in Communication Nets.” In Proceedings of the Symposium on Information Networks, 199–203. New York: Polytechnic Press of the Polytechnic Institute of Brooklyn, n.d.
Suurballe,JW“网络中的不相交路径”。网络4,没有。2(1974):125-45。doi:10.1002/net.3230040204。
Suurballe, J. W. “Disjoint Paths in a Network.” Networks 4, no. 2 (1974): 125–45. doi:10.1002/net.3230040204.
“了解快速生成树协议 (802.1w)。” 思科。访问日期:2017 年 9 月 25 日。https ://www.cisco.com/c/en/us/support/docs/lan-switching/spanning-tree-protocol/24062-146.html。
“Understanding Rapid Spanning Tree Protocol (802.1w).” Cisco. Accessed September 25, 2017. https://www.cisco.com/c/en/us/support/docs/lan-switching/spanning-tree-protocol/24062-146.html.
1. 浏览一些特定的协议,用协议来描述每个协议:
1. Go through some specific protocols, describing each in terms of Protocols:
•协议如何大致分类
• How the protocol would be broadly classified
•他们解决了书中描述的问题中的哪些问题
• Which problems they address out of the set described in the book
•他们为每个问题选择了哪些解决方案
• Which solutions they chose for each problem
•该协议没有解决哪些问题
• Which problems the protocol does not address
这告诉您该协议的收敛特性是什么?
What does this tell you about the convergence characteristics of the protocol?
每个协议都有特定的用例吗?
Is there a particular use case for each protocol?
每一项与书中描述的内容有何重叠/不同?
How does each one overlap with/differ from the ones described in the book?
• AODV
• AODV
•颤音
• TRILL
•巴贝尔
• BABEL
• OLSR
• OLSR
2. 拓扑发生变化后,STP具体采用什么方法来防止环路?你能把它与 CAP 定理联系起来吗?
2. What is the specific method used by STP to prevent loops after a topology change has occurred? Can you relate this to the CAP theorem?
3. 本文仅描述流经STP 域的单播数据包的处理。如何处理广播和多播?什么是广播风暴?为什么它在运行 STP 的网络中如此危险?
3. The text only describes the handling of unicast packets flowing through an STP domain. How are broadcasts and multicasts handled? What is a broadcast storm, and why is it so dangerous in a network running STP?
4. 从复杂性的角度审视STP。做了哪些简化假设,这些简化假设如何影响网络资源使用的优化?
4. Examine STP from a complexity perspective. What simplifying assumptions are made, and how do these simplifying assumptions impact the optimization of using network resources?
5. 研究快速生成树协议(RSTP;资源请参阅“进一步阅读”部分)的操作。RSTP 相对于生成树协议有哪些优点?
5. Research the operation of the Rapid Spanning Tree Protocol (RSTP; see the “Further Reading” section for resources). What are the advantages of RSTP over the Spanning Tree Protocol?
6. 考虑 VLAN 对 STP 的扩展。这些扩展是什么?它们使协议变得更复杂还是更简单?它们会增加还是减少网络资源的最佳利用?
6. Consider the VLAN extensions to STP. What are these extensions? Do they make the protocol more complex or less? Do they increase or decrease the optimal use of network resources?
7. 考虑文本中描述的 RIP 抑制计时器。构建一个网络,如果实现不支持抑制定时器,RIP 可能会形成环路。
7. Consider the RIP hold-down timer described in the text. Construct a network where RIP could potentially form a loop if the implementation does not support the hold-down timer.
8. 如文中所述,在状态/优化/表面的复杂性模型中分析触发的 RIP。RIP 中是否通过触发更新引入了额外的交互界面?还有额外的状态吗?优化权衡是什么?
8. Analyze triggered RIP, as described in the text, within the complexity model of state/optimization/surface. Is there an additional interaction surface introduced into RIP by triggered updates? Is there additional state? What is the optimization tradeoff?
9. EIGRP 可以承载两种不同类型的度量——窄度量和宽度量。为什么存在这两种指标?他们之间是什么关系?
9. EIGRP can carry two different kinds of metrics—narrow and wide. Why do these two kinds of metrics exist? What is the relationship between them?
10. EIGRP 可以承载两种不同类型的度量——窄度量和宽度量。描述从窄到宽的转换机制。这有效吗?过程中是否存在任何固有的问题?如果一台路由器从未升级,会发生什么情况?
10. EIGRP can carry two different kinds of metrics—narrow and wide. Describe the narrow to wide transition mechanism. Is this effective? Are there any problems inherent in the process? What happens if one router is never upgraded?
11. 考虑在 SIA 查询插入代码之前 EIGRP 处于活动状态的过程。描述过程。构建一个网络,其中 EIGRP 将重置距离路由器几跳的邻接关系,而该路由器在没有 SIA 查询的情况下不回答查询。
11. Consider the EIGRP stuck-in-active process before the SIA query was inserted in the code. Describe the process. Construct a network where EIGRP will reset an adjacency several hops away from a router that is not answering queries without the SIA query.
1 . Perlman,“扩展 LAN 中生成树的分布式计算算法”。
1. Perlman, “An Algorithm for Distributed Computation of a Spanningtree in an Extended LAN.”
2 . 亨德里克,路由信息协议。
2. Hendrick, Routing Information Protocol.
3 . 马尔金,RIP 第 2 版。
3. Malkin, RIP Version 2.
4 . Malkin 和 Minnear,用于 IPv6 的 RIPng。
4. Malkin and Minnear, RIPng for IPv6.
本章继续讨论分布式控制平面,讨论另外三种路由协议。其中两个是链路状态协议,第三个是唯一广泛部署的路径矢量协议,即边界网关协议 (BGP) 第 4 版。
This chapter continues the discussion on distributed control planes, addressing three more routing protocols. Two of these are link state protocols, and the third is the only widely deployed path vector protocol, the Border Gateway Protocol (BGP) version 4.
在本章中,重要的是要考虑为什么每个协议都是这样实现的。虽然人们总是很容易迷失在协议操作的细节中,但记住这些协议旨在解决的问题以及可能的解决方案的范围更为重要。您学习的每个方案都将是一组适度限制的可用解决方案的组合;可用的新解决方案很少;有时以独特的方式实施不同的解决方案组合来解决特定的问题。
Throughout this chapter, it is important to consider why each of these protocols is implemented the way it is. While it is always easy to become lost in the finer details of protocol operation, it is far more important to remember the problems that these protocols were designed to address and the range of possible solutions. Each protocol you study will be some combination of a moderately restricted set of available solutions; there are very few new solutions available; there are different combinations of solutions implemented in sometimes unique ways to solve specific sets of problems.
在阅读这些协议操作的高级概述时,您应该尝试找出它们实现的通用解决方案,然后将这些通用解决方案反映回任何分布式控制平面为了在实际网络中取得成功而必须解决的问题集。
When reading through these high-level overviews of protocol operation, you should try to pick out the common solutions they implement and then reflect these common solutions back into the set of problems any distributed control plane must solve in order to succeed in real networks.
中间系统到中间系统(IS-IS 或 IS 到 IS)协议在多个方面在路由协议中是独一无二的。IS-IS 的工作始于 1978 年,霍尼韦尔实验室向英国标准协会提出的七层网络模型被接受,英国标准协会随后提出在国际标准化组织 (ISO) 内组建一个工作组的想法,以标准化计算机之间的通信。这个想法非常好,以至于国际电信联盟 (ITU) 的前身成立了一个平行工作组,与 ISO 合作制定这些标准。这些委员会、它们的小组委员会和下级小组委员会,无穷无尽地创建了一套标准协议。IS-IS 协议就是其中之一。
The Intermediate System to Intermediate System (IS-IS, or IS to IS) protocol is unique among the routing protocols in several ways. The work on IS-IS began in 1978, with the acceptance of the seven-layer networking model proposed by Honeywell Labs to the British Standards Institute, which then proposed the idea of forming a working group within the International Organization for Standardization (ISO) to standardize the communications between computers. The idea was so good that the forerunner of the International Telecommunications Union (the ITU) formed a parallel working group to work with the ISO in building these standards. These committees, their subcommittees, and sub-subcommittees, ad infinitim, created a suite of standard protocols. Among these protocols was IS-IS.
开放最短路径优先 (OSPF) 最初被认为是 IS-IS 的替代方案,专门设计用于与 IPv4 网络交互。1989 年,互联网工程任务组发布了第一个 OSPF 规范,而经过大幅改进的 OSPFv2 规范于 1998 年作为 RFC2328 发布。OSPF 无疑是使用更广泛的协议,而 IS-IS 的早期实现在现实世界中几乎没有使用。存在一些反复的争论,许多功能被从一种协议“窃取”到另一种协议中(双向)。
Open Shortest Path First (OSPF) was originally conceived as an alternative to IS-IS, designed specifically to interact with IPv4 networks. In 1989, the first OSPF specification was published by the Internet Engineering Task Force, and OSPFv2, a much improved specification, was published in 1998 as RFC2328. OSPF was certainly the more widely used protocol, with early implementations of IS-IS being barely exercised in the real world. There were some back-and-forth arguments, and many features were “stolen” from one protocol into the other (in both directions).
1993年,当时网络界的重量级人物Novell使用IS-IS作为替代Netware本地路由协议的基础。Novell 的传输层 Internet 数据包交换 (IPX) 当时在大量设备上运行,单个协议路由多个传输协议的能力是网络市场的决定性优势(增强型内部网关路由协议、或EIGRP,也可以路由IPX)。该替代协议基于IS-IS;为了实现 Novell 的新协议,许多供应商只是重写了 IS-IS 的实现,并在此过程中大大改进了它们。这种重写使 IS-IS 对大型 Internet 服务提供商具有吸引力,因此当他们放弃路由信息协议 (RIP) 时,他们通常会转向 IS-IS 而不是 OSPF。
In 1993, Novell, a heavyweight in the networking world at the time, used IS-IS as the basis for a replacement to the Netware native routing protocol. Novell’s transport layer, Internet Packet Exchange (IPX), ran on a large number of devices at the time, and the ability for a single protocol to route multiple transport protocols was a definitive advantage in the networking market (the Enhanced Interior Gateway Routing Protocol, or EIGRP, can also route IPX). This replacement protocol was based on IS-IS; to implement the Novell’s new protocol, many vendors simply rewrote their implementations of IS-IS, greatly improving them in the process. This rewrite made IS-IS attractive to large-scale Internet service providers, so as they moved off the Routing Information Protocol (RIP), they would often move onto IS-IS instead of OSPF.
笔记
Note
这段历史的部分内容依赖于戴夫·卡茨 (Dave Katz) 2000 年夏天在北美网络运营商集团 (NANOG) 上的演讲。1其他部分依赖于IS-IS:IP 网络中的部署中给出的历史记录。2
Parts of this history rely on Dave Katz’s presentation at the North American Network Operators’ Group (NANOG) in the summer of 2000.1 Other parts rely on the history given in IS-IS: Deployment in IP Networks.2
在中间系统到中间系统(IS-IS)协议中,路由器称为中间系统(IS),主机称为终端系统(ES)。该套件的最初设计是让每个设备(而不是接口)拥有一个地址。设备上的服务和接口将具有网络服务访问点 (NSAP),用于将流量引导至特定服务或接口。那么,从 IP 角度来看,IS-IS 最初是在主机路由范例中设计的;中间系统和终端系统使用终端系统到中间系统 (ES-IS) 协议直接通信,允许 IS-IS 发现任何连接的终端系统上可用的服务,并将较低的接口地址与较高层的设备地址进行匹配。
In the Intermediate System to Intermediate System (IS-IS) protocol, a router is called an Intermediate System (IS), and a host is called an End System (ES). The original design of the suite was for each device, rather than interface, to have a single address. Services and interfaces on a device would then have a Network Service Access Point (NSAP), used to direct traffic to a specific service or interface. From an IP perspective, then, IS-IS was originally designed within a host routing paradigm; Intermediate and End Systems communicated directly using the End System to Intermediate System (ES-IS) protocol, allowing IS-IS to discover the services available on any connected End System, as well as to match lower interface addresses with higher layer device addresses.
IS-IS 设计的另一个有趣的方面是它运行在链路层。对于协议的设计者来说,运行控制平面来为传输系统提供通过传输系统本身的可达性并没有多大意义。路由器不会转发 IS-IS 数据包,因为它们与协议栈中的 IP 并行并传输到链路本地地址。当IS-IS开发时,大多数链路的速度非常低,因此额外的封装也被认为是浪费。链接也经常失败,导致数据包丢失和损坏;因此,该协议旨在承受传输错误和数据包丢失。
Another interesting aspect of the design of IS-IS is it runs at the link layer; it did not make a lot of sense to the designers of the protocol to run the control plane to provide reachability for a transport system over the transport system itself. Routers will not forward IS-IS packets, as they are parallel to IP in the protocol stack and transmitted to link local addresses. When IS-IS was developed, most links were very low speed, so the extra encapsulation was also thought to be wasteful. Links also failed quite often, losing and corrupting packets; hence the protocol was designed to withstand errors in transmission and packet loss.
由于 IS-IS 是为不同的传输协议套件开发的,因此它不使用互联网协议 (IP) 地址来识别设备。相反,它使用开放系统互连 (OSI) 地址来识别中间系统和终端系统。现场视察寻址方案有些复杂,包括分配地址空间的机构标识符、两部分域标识符、区域标识符、系统标识符和服务选择器(N选择器);OSI 地址的许多部分都是可变长度的,使得系统更加难以理解。然而,在 IP 世界中,仅使用了该地址空间的三个部分。
As IS-IS was developed for a different transport protocol suite, it does not use Internet Protocol (IP) addresses to identify devices. Instead, it uses an Open Systems Interconnect (OSI) address to identify both Intermediate and End Systems. The OSI addressing scheme is somewhat complex, including identifiers for the authority allocating the address space, a two-part domain identifier, an area identifier, a system identifier, and a service selector (the N Selector); many of these parts of the OSI address are variable length, making the system even more difficult to understand. Within the IP world, however, only three parts of this address space are used.
• 权限格式标识符(AFI)、初始域标识符(IDI)、高阶域特定部分(HO-DSP) 和区域都被视为称为区域的单个字段。
• The Authority Format Identifier (AFI), Initial Domain Identifier (IDI), High-Order Domain Specific Part (HO-DSP), and the area are all treated as a single field called the area.
• 系统标识符仍被视为系统标识符。
• The System Identifier is still treated as the system identifier.
• N 选择器或NSAP 通常被忽略(尽管有一个接口标识符与某些特定情况下使用的NSAP 类似)。
• The N Selector, or NSAP, is generally ignored (although there is an interface identifier that is similar to the NSAP used in some specific situations).
那么,中间系统地址通常采用图 16-1所示的形式。
Intermediate system addresses, then, normally take the form illustrated in Figure 16-1.
在图 16-1中:
In Figure 16-1:
• 系统标识符和地址其余部分之间的分界点位于第六个八位位组,即从右侧算起的十二个十六进制数字。第六个八位位组左侧的所有内容都被视为区域地址的一部分。
• The dividing point between the system identifier and the remainder of the address is at the sixth octet, or twelve hexadecimal digits from the right side; everything to the left of the sixth octet is considered part of the area address.
• 如果包含N 选择器,则它是系统标识符右侧的单个八位位组或两个十六进制数字。例如,如果地址 A 包含 N 选择器,则它可能是 49.0011.2222.0000.0000.000A.00。
• If the N Selector is included, it is a single octet, or two hexadecimal digits, to the right of the system identifier; for instance, if an N Selector were included for address A, it might be 49.0011.2222.0000.0000.000A.00.
• 如果地址中包含N 选择器,则在计数超过六个八位字节以查找区域地址的起始位置时,需要跳过N 选择器。
• If an N Selector is included in the address, you need to skip the N Selector when counting over six octets to find the start of the area address.
• A 和B 位于同一泛洪域中,因为它们共享地址中从第七个八位位组到最左侧八位位组的相同数字。
• A and B are in the same flooding domain because they share the same digits from the seventh octet to the leftmost octet in the address.
• C 和D 位于同一泛洪域中。
• C and D are in the same flooding domain.
• A 和D 代表不同的系统,尽管它们的系统标识符相同;然而,这种寻址方式可能会非常混乱,因此在实际的 IS-IS 部署中不会使用(至少深思熟虑的系统管理员不会使用)。
• A and D represent different systems, although their system identifier is the same; this sort of addressing, however, can be very confusing, and so is not used in real IS-IS deployments (at least not by thoughtful system administrators).
您可能会发现这种寻址方案比 IP 更难使用,即使您经常使用 IS-IS 作为路由协议。然而,使用与网络传输层所使用的寻址方案不同的寻址方案有一个主要优点;当考虑 Dijkstra 的最短路径优先 (SPF) 算法时,区分网络上的设备类型要容易得多,并且更容易将节点与目的地分开。
You may find this addressing scheme more difficult than IP to work with, even if you work with IS-IS as a routing protocol on a regular basis. There is a major advantage to using an addressing scheme that is different from the one being used at the transport level in a network, however; it is much easier to differentiate between the kinds of devices on the network, and it is much easier to separate nodes from destinations when thinking through Dijkstra’s Shortest Path First (SPF) algorithm.
IS-IS 使用一种相当有趣的机制来编组数据以便在中间系统之间传输。每个IS生成三种数据包:
IS-IS uses a fairly interesting mechanism to marshal data for transmission between intermediate systems. Each IS generates three kinds of packets:
• Hello 数据包
• Hello packets
• 序列号数据包(部分,PSNP;完整,CSNP)
• Sequence Number Packets (Partial, PSNPs; and Complete, CSNPs)
• 单个链路状态数据包 (LSP)
• A single Link State Packet (LSP)
单个 LSP 包含有关 IS 本身、任何可到达的中间系统以及附加到 IS 的任何可到达目的地的所有信息。该单个 LSP 被格式化为类型长度向量 (TLV),其中包含各种信息位。一些更常见的 TLV 包括:
The single LSP contains all the information about the IS itself, any reachable intermediate systems, and any reachable destinations attached to the IS. This single LSP is formatted into Type Length Vectors (TLVs), which contain various bits of information. Some of the more common TLVs include the following:
•类型 2 和 22:到另一个中间系统的可达性
• Types 2 and 22: Reachability to another intermediate system
•类型 128、135 和 235: IPv4 目标的可达性
• Types 128, 135, and 235: Reachability to an IPv4 destination
•类型 236 和 237: IPv6 目标的可达性
• Types 236 and 237: Reachability to an IPv6 destination
有多种类型,因为 IS-IS 最初支持 6 位度量(协议定义时大多数处理器一次只能容纳 8 位,并且从该字段大小中“窃取”了两个位来携带有关是否路线是内部还是外部以及其他信息)。随着时间的推移,随着链路速度的提高,引入了各种其他度量长度,包括 24 位和 32 位度量,以支持宽度量。
There are multiple types because IS-IS originally supported 6-bit metrics (most processors at the time of the protocol’s definition could hold only 8 bits at a time, and two bits were “stolen” from this field size to carry information about whether the route was internal or external as well as other information). Over time, as link speeds increased, various other metric lengths were introduced, including 24- and 32-bit metrics, to support wide metrics.
携带所有 IS、IPv4 和 IPv6 可达性信息(以及可能的 MPLS 标签和其他信息)的单个 LSP 将无法装入单个 MTU 大小的数据包中。为了通过网络实际发送信息,IS-IS 将 LSP 分解为片段。每个片段在洪泛过程中都被视为一个单独的实体。如果一个分片发生变化,则只有变化的分片会通过网络洪泛,而不是整个 LSP。由于这种方案,IS-IS 在洪泛新拓扑和可达性信息方面非常有效,而无需使用超过所需的最小带宽量。
The single LSP carrying all IS, IPv4, and IPv6 reachability information—as well as, potentially, MPLS tags and other information—will not fit into a single MTU-sized packet. To actually send information over the network, IS-IS breaks up the LSP into fragments. Each fragment is treated as a separate entity in the flooding process. If one fragment changes, just the changed fragment is flooded through the network, rather than the entire LSP. Because of this scheme, IS-IS is very efficient at flooding new topology and reachability information without using more than the minimum amount of bandwidth required.
虽然 IS-IS 最初设计为通过 ES-IS 协议了解网络可达性,但当 IS-IS 用于路由 IP 时,它“按照 IP 协议的方式工作”,并通过每个协议的本地配置来了解可到达的目的地。设备,并通过其他路由协议重新分配。因此,IS-IS 是一种主动协议,无需等待数据包通过网络传输和转发即可了解并通告可达性。
While IS-IS was originally designed to learn about network reachability through the ES-IS protocol, when IS-IS is used to route IP, it “does as the IP protocols do,” and learns about reachable destinations through the local configuration of each device, and through redistribution from other routing protocols. Hence IS-IS is a proactive protocol, learning about and advertising reachability without waiting on packets to be transmitted and forwarded through the network.
IS-IS 中的邻居形成相当简单;图 16-2说明了该过程。
Neighbor formation in IS-IS is fairly simple; Figure 16-2 illustrates the process.
在图 16-2中:
In Figure 16-2:
1. IS A 向 B 发送 hello。该 hello 包含收到的邻居列表,该列表将为空;A 应该使用保持时间设置 B;并将其填充到链路的本地接口最大传输单元 (MTU)。仅在邻接形成过程完成之前填充 Hello 数据包;并非每个 hello 数据包都被填充到链路的完整 MTU。
1. IS A transmits a hello toward B. This hello contains a list of neighbors heard from, which will be empty; the hold time setting B should use for A; and it is padded to the local interface Maximum Transmission Unit (MTU) for the link. Hello packets are padded only until the adjacency formation process is complete; not every hello packet is padded to the full MTU of the link.
2. IS B 向 A 发送 hello。该 hello 包含收到的邻居列表,其中包括 A;B 应该使用 A 的保持时间设置;并将其填充到本地接口 MTU。
2. IS B transmits a hello toward A. This hello contains a list of neighbors heard from, which would include A; the hold time setting A should use for B; and it is padded to the local interface MTU.
3. 由于 A 位于 B 的“听到的邻居”列表中,因此 A 将考虑 B 并进入邻居形成的下一阶段。
3. Because A is in B’s “heard neighbor” list, A will consider B up and move to the next stage of neighbor formation.
4. 一旦 A 在至少一次 hello 中将 B 包含在“听到的邻居”列表中,B 将考虑 A 并进入邻居形成的下一阶段。
4. Once A has included B in the “heard neighbor” list in at least one hello, B will consider A up and move to the next stage of neighbor formation.
5. B 将发送其本地拓扑表中所有条目的完整列表(B 描述其本地数据库中的 LSP)。该列表以完整序列号数据包 (CSNP) 的形式发送。
5. B will send a complete list of all the entries it has in its local topology table (B describes the LSPs it has in its local database). This list is sent in a Complete Sequence Number Packet (CSNP).
6. A 将检查其本地拓扑表,并将其与 B 发送的完整列表进行比较;如果它没有任何拓扑表条目 (LSP),它将使用部分序列号数据包 (PSNP) 向 B 请求。
6. A will examine its local topology table, comparing it to the complete list sent by B; any topology table entries (LSPs) it does not have, it will request from B using a Partial Sequence Number Packet (PSNP).
7. 当 B 收到 PSNP 时,它将在 A 所请求的本地拓扑表 (LSP) 中的任何条目上设置发送路由消息 (SRM) 标志。
7. When B receives a PSNP, it will set the Send Route Message (SRM) flag on any entry in its local topology table (LSPs) A has requested.
8. 洪泛过程随后将遍历本地拓扑表,寻找设置了 SRM 标志的条目;它将淹没这些条目,同步 A 和 B 的数据库。
8. The flooding process will later walk the local topology table looking for entries with the SRM flag set; it will flood these entries, synchronizing the databases at A and B.
笔记
Note
这里描述的过程包括 RFC5303 所做的修改,它指定了三向握手和 hello 填充,它在 2005 年左右被添加到大多数实现中。
The process described here includes modifications made by RFC5303, which specifies a three-way handshake, and hello padding, which was added to most implementations around 2005.
设置SRM标志标记了洪泛的信息,但是洪泛实际上是如何发生的呢?
Setting the SRM flag marks the information for flooding, but how does flooding actually take place?
为了使 Dijkstra 的 SPF 算法(或任何其他 SPF 算法)正常工作,洪泛域中的每个 IS 必须共享一个同步数据库。两个中间系统之间的数据库中的任何不一致都可能导致永久路由环路。IS-IS如何确保连接的中间系统具有同步的数据库?本节描述点对点链路的过程;以下部分将描述对多路访问(例如以太网)链路上的泛洪过程所做的修改。
For Dijkstra’s SPF algorithm (or any other SPF algorithm) to work correctly, every IS in the flooding domain must share a synchronized database. Any inconsistency in the database between two intermediate systems opens the possibility of a permanent routing loop. How does IS-IS ensure connected intermediate systems have synchronized databases? This section describes the process on point-to-point links; the following section will describe the modifications made to the flooding process on multiaccess (such as Ethernet) links.
IS-IS依靠LSP头中的许多字段来确保两个中间系统具有同步的数据库;图 16-3说明了这些字段。
IS-IS relies on a number of fields in the LSP header to ensure two intermediate systems have synchronized databases; Figure 16-3 illustrates these fields.
在图 16-3中:
In Figure 16-3:
• 数据包长度包含数据包的总长度(以八位字节为单位)。例如,如果该字段包含 15,则数据包的长度为 15 个八位位组。数据包长度字段为 2 个八位位组,因此它可以描述长达 65,536 个八位位组的数据包,甚至比最大的链路 MTU 还要长。
• The packet length contains the total length of the packet in octets. For instance, if this field contains 15, the packet is 15 octets in length. The packet length field is 2 octets, so it can describe a packet up to 65,536 octets long—longer than even the largest link MTUs.
• 剩余寿命字段也是两个八位位组,包含该LSP 有效的秒数。这迫使 LSP 中携带的信息偶尔刷新,这是旧式传输技术的一个重要考虑因素,其中位可能会翻转,数据包可能会被截断,或者通过链路携带的信息可能会被损坏。使用递减计数而不是递增计数的计时器的优点是网络中的每个 IS 都可以独立于其他每个 IS 确定其信息应保持有效的时间。缺点是没有明确的方法来禁用所描述的功能。然而,65,536 秒是一个很长的时间——1,092 分钟,或者说大约 18 个小时。每 18 小时左右重新洪泛网络中的每个 LSP 片段对网络的运行造成的负担非常小。
• The remaining lifetime field is also two octets and contains the number of seconds for which this LSP is valid. This forces the information carried in the LSP to be refreshed occasionally, an important consideration on older transmission technologies, where bits can be flipped, packets can be truncated, or information carried through the link can otherwise be corrupted. The advantage of having a timer that counts down, rather than up, is each IS in the network can determine how long its information should remain valid independently of every other IS. The disadvantage is there is no clear way to disable the functionality described. However, 65,536 seconds is a long time—1,092 minutes, or around 18 hours. Reflooding every LSP fragment in the network every 18 hours or so poses very little burden on the operation of the network.
• LSP ID 描述LSP 本身。实际上,该字段描述的是分片,因为它包含始发系统标识符、伪节点标识符(该标识符的作用将在后面描述)和LSP号,或者更确切地说是LSP分片号。单个LSP片段中包含的信息在整个网络中被视为“一个单元”;单个 LSP 片段永远不会被其他 IS“重新分段”。该字段通常为 8 个八位位组。
• The LSP ID describes the LSP itself. Actually, this field describes the fragment, as it contains the originating system identifier, the pseudonode identifier (the function of this identifier is described later), and the LSP number, or rather the LSP fragment number. The information contained in a single LSP fragment is treated as “one unit” throughout the entire network; a single LSP fragment is never “refragmented” by some other IS. This field is normally 8 octets.
• 序列号描述了该LSP 的版本。序列号确保网络中的每个 IS 在其拓扑表的本地副本中具有相同的信息。它还确保攻击者(或损坏的实现)无法重放旧信息来替换新信息。
• The Sequence Number describes the version of this LSP. The sequence number ensures every IS in the network has the same information in its local copy of the topology table. It also ensures an attacker (or broken implementation) cannot replay older information to replace new.
• 校验和确保LSP 片段中携带的信息在传输过程中没有被修改。
• The Checksum ensures the information carried in the LSP fragment has not been modified during transmission.
笔记
Note
术语LSP通常用于两个不同的事物:描述特定 IS 的所有连接性和其他信息的完整 LSP,以及通过网络传输的 LSP 的每个片段。这样,一条LSP就被分割成一条条LSP,每一条LSP都在网络中传输。这可能会令人困惑;本书将始终将传输的 LSP 称为 LSP 片段或片段,以及由 IS 生成的描述其整个连接性的 LSP,简称为 LSP。
The term LSP is often used for two different things: the complete LSP describing all the connectivity and other information about a particular IS, and each fragment of the LSP as it is transmitted through the network. Hence, an LSP is split into LSPs, each of which is transmitted through the network. This can be confusing; this book will always call the LSP as it is transmitted the LSP fragment or fragment, and the LSP as generated by the IS, describing its entire connectivity, simply an LSP.
泛洪使用图16-4进行描述。
Flooding is described using Figure 16-4.
在图 16-4中:
In Figure 16-4:
1. A 连接到 2001:db8:3e8:100::/64。A 构建一个新片段来描述这个新到达的目的地。
1. A is connected to 2001:db8:3e8:100::/64. A builds a new fragment describing this newly reachable destination.
2. A 将此片段上的 SRM 标志设置为 B。
2. A sets the SRM flag on this fragment toward B.
3. 洪泛过程在未来的某个时刻(通常是几毫秒)将检查拓扑表并洪泛设置了 SRM 标志的任何条目。
3. The flooding process, at some point in the future (usually a matter of milli-seconds), will examine the topology table and flood any entries with the SRM flag set.
4. 一旦新条目被放入其拓扑表中,B 将创建一个描述其整个数据库的 CSNP 并将其发送给 A。
4. Once the new entry is placed in its topology table, B will create a CSNP describing its entire database and send this to A.
5. 收到此 CSNP 后,A 清除其向 B 发送的 SRM 标志。
5. On receiving this CSNP, A clears its SRM flag toward B.
6. B 验证校验和并将接收到的片段与其拓扑表中的现有条目进行比较。由于没有其他条目与该系统和片段标识符匹配,因此它将把新片段放置在其本地拓扑表中。鉴于这是一个新片段,B 将向 C 发起洪泛过程。
6. B verifies the checksum and compares the received fragment to existing entries in its topology table. As there is no other entry matching this system and fragment identifier, it will place the new fragment in its local topology table. Given this is a new fragment, B will initiate the flooding process toward C.
删除信息怎么办?可通过三种方式从 IS-IS 泛洪系统中删除信息:
What about removing information? There are three ways information can be removed from the IS-IS flooding system:
• 发起IS 可以发起一个没有相关信息且具有更高序列号的新片段。
• The originating IS can originate a new fragment without the relevant information and with a higher sequence number.
• 如果整个片段不再包含任何有效信息,则始发IS 可以以0 秒的剩余生命周期洪泛该片段。这会导致洪泛域中的每个 IS 重新洪泛零年龄片段,并将其从未来 SPF 计算的考虑中删除。
• If the entire fragment no longer contains any valid information, the originating IS can flood the fragment with a remaining lifetime of 0 seconds. This causes each IS in the flooding domain to reflood the zero age fragment and remove it from consideration for future SPF calculations.
• 如果片段中的剩余生存期计时器在任何IS 处超时,则该片段将被零年龄的剩余生存期淹没。每个接收到这个零时效片段的 IS 将验证它是片段的最新副本(基于序列号),将片段的本地副本的剩余生命周期设置为零秒,然后重新洪泛片段。这称为从网络中刷新片段。
• If the remaining lifetime timer in a fragment times out at any IS, the fragment is flooded with a zero age remaining lifetime. Each IS receiving this zero-aged fragment will verify it is the most recent copy of the fragment (based on the sequence number), set the remaining lifetime to its local copy of the fragment to zero seconds, and reflood the fragment. This is called flushing a fragment from the network.
当 IS 发送 CNSP 来回复其收到的片段时,它实际上验证整个数据库,而不仅仅是其收到的一个片段。每当一个片段通过网络泛滥时,整个数据库就会在每对中间系统之间进行检查。
When an IS sends a CNSP in reply to a fragment it has received, it actually verifies the entire database, rather than just the one fragment it received. Each time a fragment is flooded through the network, the entire database is checked between each pair of intermediate systems.
IS-IS可以描述为
IS-IS can be described as
• 使用洪泛来同步洪泛域中每个中间系统的数据库(链路状态协议)。
• Using flooding to synchronize the database at every intermediate system in the flooding domain (a link state protocol).
• 使用Dijkstra 的SPF 算法计算无环路路径。
• Calculating loop-free paths using Dijkstra’s SPF algorithm.
• 通过配置和本地信息了解可到达的目的地(主动协议)。
• Learning about reachable destinations through configuration and local information (a proactive protocol).
• 通过在其hello 数据包中携带“看到的邻居”列表来验证邻居形成中的双向连接。
• Validating two-way connectivity in neighbor formation by carrying a list of “neighbors seen” in its hello packets.
• 通过序列号和每个片段中剩余生命周期字段的组合,从洪泛域中删除信息。
• Removing information from the flooding domain through a combination of sequence numbers and remaining lifetime fields in each fragment.
• 通过填充最初交换的hello 数据包来验证每个链路的MTU。
• Verifying the MTU of each link by padding the initially exchanged hello packets.
• 通过校验和、定期刷新以及中间系统之间交换的数据库描述来验证同步数据库中信息的正确性。
• Validating the correctness of the information in the synchronized database through checksums, periodic reflooding, and database descriptions exchanged between intermediate systems.
IS-IS 是一种广泛部署的路由协议,已被证明能够满足各种网络拓扑和操作要求。
IS-IS is a widely deployed routing protocol that has proven capable in a wide range of network topologies and operational requirements.
2013 年,发布了用于路由 IPv6 的 OSPF 版本。它被称为OSPFv3,最初在RFC2740中指定,后来被RFC5340取代,并由后来的标准更新。OSPFv3 是本章中 OSPF 操作的任何具体细节所假定的版本。
In 2013, a version of OSPF was published for routing IPv6. Known as OSPFv3, it was originally specified in RFC2740, which was later replaced by RFC5340, and updated by later standards. OSPFv3 is the version assumed for any specific details of OSPF operation in this chapter.
与网络工程早期开发的许多其他协议一样,OSPF 旨在最大限度地减少通过网络承载 IPv4 路由信息所需的处理能力、内存和带宽。OSPF 设计过程早期做出的两个具体选择反映了对资源利用率的关注:
Like many of the other protocols developed in the early days of network engineering, OSPF was designed to minimize the processing power, memory, and bandwidth required to carry routing information for IPv4 through the network. Two specific choices made early on in the OSPF design process reflect this concern with resource utilization:
• OSPF 依靠固定长度字段来编组数据,而不是 TLV。这节省了以类型长度值 (TLV) 标头形式携带附加元数据的开销,通过允许固定大小的内存数据结构与离线接收的数据包进行匹配来降低处理要求,并减少了数据包的大小。线路上的 OSPF 数据。
• OSPF relies on fixed length fields to marshal data, rather than TLVs. This saves the overhead of carrying the additional metadata in the form of Type Length Value (TLV) headers, reduces processing requirements by allowing fixed sized in memory data structures to be matched with packets as they are received off the wire, and reduces the size of OSPF data on the wire.
• OSPF 将拓扑数据库分解为多种数据,而不是依赖于带有TLV 的单个LSP。这意味着每种信息(可达性、拓扑等)都以独特的数据包格式承载。
• OSPF breaks the topology database up into multiple kinds of data, rather than relying on a single LSP with TLVs. This means each kind of information— reachability, topology, etc.—is carried in a unique packet format.
笔记
Note
OSPFv3 的最新工作用基于 TLV 的 LSA 替换了当前固定字段 LSA 中的特定字段。有关详细信息,请参阅OSPFv3 LSA 可扩展性。3
More recent work in OSPFv3 replaces specific fields in the current fixed field LSAs with TLV-based LSAs. See OSPFv3 LSA Extendibility for more information.3
OSPF 可以承载的每种类型的信息都承载在不同类型的链路状态通告 (LSA) 中。一些比较著名的 LSA 类型如下:
Each type of information OSPF can carry is carried in a different type of Link State Advertisement (LSA). Some of the more notable types of LSAs are as follows:
•类型1:代码0x2001,路由器LSA
• Type 1: code 0x2001, Router LSA
•类型 2:代码 0x2002,网络 LSA
• Type 2: code 0x2002, Network LSA
•类型 3:代码 0x2003,区域间前缀 LSA
• Type 3: code 0x2003, Inter-Area Prefix LSA
• Type 4: code 0x2004, Inter-Area Router LSA
•类型 5:代码 0x4005,AS 外部 LSA
• Type 5: code 0x4005, AS-external LSA
•类型 7:代码 0x2007,类型 7 (NSSA) LSA
• Type 7: code 0x2007, Type-7 (NSSA) LSA
还有许多其他类型的 LSA,包括不透明数据、多播组成员身份和范围泛洪 LSA(例如针对单个邻居、单个链路或单个泛洪域)。
There are a number of other types of LSAs, including opaque data, multicast group membership, and scoped flooding LSAs (such as to a single neighbor, a single link, or a single flooding domain).
每个 OSPF 路由器精确生成一个 Router LSA(类型 1);该 LSA 描述与广告路由器相邻的任何邻居,以及任何连接的可达目的地。到这些邻居和目的地的链路状态是从邻居和目的地的通告中推断出来的;尽管名称为“链接状态”,但链接并未作为单独的“事物”进行宣传(这通常是一个混淆点)。如果路由器 LSA 变得太大而无法容纳在单个 IP 数据包中(由于链路 MTU),它将被分割成多个 IP 片段,以便在路由器之间传输。每个路由器在本地处理之前会重新组装整个 Router LSA,并且如果发生变化则洪泛整个 Router LSA。
Each OSPF router generates precisely one Router LSA (type 1); this LSA describes any neighbors adjacent to the advertising router, as well as any connected reachable destinations. The state of the links to these neighbors and destinations is inferred from the advertisement of the neighbors and destination; in spite of the name “link state,” links are not advertised as a separate “thing” (this is often a point of confusion). If the Router LSA becomes too large to fit within a single IP packet (because of the link MTU), it will be split into multiple IP fragments for transmission router to router. Each router reassembles the entire Router LSA before processing it locally and floods the entire Router LSA if it changes.
OSPF 还使用几种不同的数据包类型,这些类型与 LSA 类型不同。相反,这些可以被视为 OSPF 内的不同“服务”,或者可能被视为在用户数据报协议 (UDP) 或传输控制协议 (TCP) 之上运行的不同“端口号”。
OSPF uses a few different packet types, as well—these are not the same as the LSA types. Rather, these can be thought of as different “services” within OSPF or, perhaps, as different “port numbers” running on top of User Datagram Protocol (UDP) or the Transmission Control Protocol (TCP).
• hello 是类型1。它们用于邻居发现和活动。
• The hello is a type 1. These are used for neighbor discovery and liveness.
•数据库描述符(DBD) 是类型2。它们用于描述本地拓扑表。
• The Database Descriptor (DBD) is a type 2. These are used to describe the local topology table.
•链路状态请求(LSR) 属于类型3。它们用于向相邻路由器请求特定的链路状态通告。
• The Link State Request (LSR) is a type 3. These are used to request specific Link State Advertisements from an adjacent router.
•链路状态更新(LSU) 是类型4。它们用于携带本节中描述的链路状态通告。
• The Link State Update (LSU) is a type 4. These are used to carry the Link State Advertisements described in this section.
• 链路状态确认是类型5。这只是LSA 标头的列表;该数据包中列出的任何 LSA 均被确认为已被发送路由器接收。
• The Link State Acknowledgment is a type 5. This is simply a list of LSA headers; any LSA listed in this packet is acknowledged as being received by the transmitting router.
作为链路状态协议,OSPF 必须确保区域(泛洪域)内的每个路由器都具有相同的数据库来计算无环路路径。共享拓扑数据库中的任何变化都可能导致路由环路,该环路的持续时间与共享拓扑数据库中存在变化。因此,OSPF 邻居形成的目的之一是确保拓扑信息通过网络可靠地泛洪。OSPF 邻居形成的第二个原因是通过确定哪些路由器与本地路由器相邻来发现网络拓扑。OSPF邻居形成过程如图16-5所示。
As a link state protocol, OSPF must ensure every router within an area (a flooding domain) has the same database to calculate loop-free paths from. Any variation in the shared topology database can result in a routing loop that will last as long as the variation in the shared topology database exists. One purpose for OSPF neighbor formation, then, is to ensure the reliable flooding of topology information through the network. A second reason for OSPF neighbor formation is to discover the network topology, by determining which routers are adjacent to the local router. Figure 16-5 illustrates the OSPF neighbor formation process.
在图 16-5中:
In Figure 16-5:
1. B 向 A 发送 hello 数据包。
1. B sends a hello packet to A.
2. 由于 B 的 hello 包含空的邻居可见列表,因此 A 将 B 置于初始状态并将 B 添加到其邻居可见列表中。
2. Since B’s hello contains an empty neighbors seen list, A places B into init state and adds B to its neighbors seen list.
3. A 向其邻居可见列表中的 B 发送问候。
3. A sends a hello with B in its neighbors seen list.
4. B 收到 A 的 hello,并向其邻居可见列表中的 A 发送 hello 。
4. B receives A’s hello and sends a hello with A in its neighbors seen list.
5. A收到这个hello;由于 A 本身位于可见邻居列表中,因此 A 将 B 置于双向状态。这意味着 A 已验证其与 B 之间存在双向连接。
5. A receives this hello; as A itself is in the neighbors seen list, A places B into the two-way state. This means that A has verified two-way connectivity exists between itself and B.
6、如果该链路上有正在选举的DR和BDR(DR和BDR的作用稍后考虑),则在步骤5之后进行选举。选举完成后,将DR和BDR放入启动状态。在此状态下,选举主从设备来交换 DBD 和 LSA。本质上,主设备控制新相邻路由器之间的 DBD 和 LSA 流。从技术上讲,点对点链路上的相邻路由器此时会直接跳到完整状态。
6. If there are a DR and BDR being elected on this link (the function of the DR and BDR is considered in a moment), the election takes place after step 5. Once the election is completed, the DR and BDR are placed in the exstart state. During this state, the master and slave are elected for the exchange of DBDs and LSAs. Essentially, the master controls the flow of DBDs and LSAs between the newly adjacent routers. Adjacent routers on a point-to-point link technically skip directly to full state at this point.
7. B 进入交换状态。
7. B is moved to the exchange state.
8. A向B发送一组描述其数据库的DBD;B 向 A 发送一组描述其数据库的 DBD。
8. A sends a set of DBDs describing its database to B; B sends a set of DBDs describing its database to A.
9. A 向 B 发送 B 描述的每个 LSA 的链路状态请求,而 A 在其本地拓扑表中没有该链路状态请求的副本。
9. A sends a link state request to B for each LSA B describes, and A does not have a copy of it in its local topology table.
10. B 为来自 A 的每个链路状态 (LS) 请求发送一个 LSA。
10. B sends an LSA for each Link State (LS) request from A.
11. 数据库同步后,B 将转至完整状态。
11. Once the databases are synchronized, B is moved to full state.
OSPF邻居形成过程通过在hello中携带出接口的MTU来验证链路两端的MTU是否匹配;如果两个 hello 数据包的 MTU 大小不匹配,则两个 OSPF 路由器将不会形成邻接关系。
The OSPF neighbor formation process verifies the MTUs on both ends of the link match by carrying the MTU of the outbound interface in the hello; if the two hello packets do not match in MTU size, the two OSPF routers will not form an adjacency.
OSPF 不仅必须确保拓扑信息的初始交换完成,而且还必须确保网络拓扑中正在进行的更改被洪泛到洪泛域中的每个路由器。图16-6展示了OSPF LSA头;检查此标头将得出有关 OSPF 如何通过网络可靠地泛洪拓扑和可达性信息的方式的一些重要线索。
OSPF must not only ensure the initial exchange of topology information is completed, but it must also ensure ongoing changes in the network topology are flooded to every router in the flooding domain. Figure 16-6 illustrates the OSPF LSA header; examining this header will yield some important clues about the way OSPF reliably floods topology and reachability information through the network.
在图 16-6中:
In Figure 16-6:
• LS Age(大致)是自生成此链路状态通告以来的秒数。这个数字是向上计数,而不是向下计数。当 LS Age 达到 MAXAGE 设置时(在任何路由器上,而不仅仅是始发路由器),路由器会将序列号加 1,将 LS Age 设置为最大值老化,并在整个网络中重新洪泛 LSA。这会删除一段时间内未刷新的旧拓扑和可达性信息。发起任何特定 LSA 的路由器将在该 LSA Age 字段达到最大值之前的几秒内刷新其 LSA;这是LS刷新间隔。
• The LS Age is (roughly) the number of seconds since this Link State Advertisement was generated. This number counts up, rather than down. When the LS Age reaches the MAXAGE setting (on any router, not just the originating router), the router will increment the sequence number by 1, set the LS Age to the maximum age, and reflood the LSA throughout the network. This removes older topology and reachability information that has not been refreshed in a while. The router that originates any particular LSA will refresh its LSAs some number of seconds before this LSA Age field reaches the maximum; this is the LS refresh interval.
• 链路状态标识符是由始发路由器分配的用于描述此LSA 的唯一标识符。它通常是链路地址或某些本地链路层地址(例如以太网媒体访问控制或 MAC 地址)。
• The Link State Identifier is a unique identifier assigned by the originating router to describe this LSA. It is normally the link address, or some local link layer address (such as an Ethernet Media Access Control, or MAC, address).
• 广告路由器是始发路由器的路由器ID。这经常与 IP 地址混淆,因为它通常源自本地配置的 IP 地址,但它不是IP地址。
• The Advertising Router is the router ID of the originating router. This is often confused with an IP address, as it is often derived from a locally configured IP address—but it is not an IP address.
• 链路状态序列号指示LSA 的版本。一般来说,数字越大意味着版本越新,尽管 OSPF 的早期版本使用循环数字空间,而不是绝对递增的数字空间。如果到达号码空间的末尾,使用绝对递增号码空间的实现将重新启动 OSPF 进程。
• The Link State Sequence Number indicates the version of the LSA. Generally, higher numbers mean newer versions, although there are earlier versions of OSPF that use a circular number space, rather than an absolutely incrementing one. Implementations that use an absolutely incrementing number space restart the OSPF process if the end of the number space is reached.
• 链路状态校验和是通过LSA 计算的校验和,用于捕获信息传输或存储中的错误。
• The Link State Checksum is a checksum computed across the LSA used to catch errors in transmission or storage of the information.
图16-7用于检查泛洪过程。
Figure 16-7 is used to examine the flooding process.
在图 16-7中:
In Figure 16-7:
1. 到 2001:db8:3e8:100::/64 的链接在 A 处配置、启动、连接等。
1. The link to 2001:db8:3e8:100::/64 is configured, brought up, connected, etc., at A.
2. A 重建其 Router LSA(类型 1)以包含此新的可达性信息,将其打包到 LSU(在放入 IP 数据包时可能会被分段),并将其洪泛到 B。
2. A rebuilds its Router LSA (type 1) to contain this new reachability information, packages it into an LSU (which may be fragmented while being placed into IP packets), and floods it to B.
3. B 接收此 LSA 并通过链路状态确认来确认其收到。如果 B 没有足够快地确认,A 将重新发送 LSA。
3. B receives this LSA and acknowledges its receipt with a link state acknowledgment. A will resend the LSA if B does not acknowledge it quickly enough.
4. B 现在将检查其拓扑表以确定此 LSA 是新的还是其已有 LSA 的副本。B 主要通过检查 LSA 本身中包含的序列号来确定这一点。如果这是一个新的(或更新的)LSA,B 将启动相同的过程将更改的 LSA 洪泛到 C。
4. B will now examine its topology table to determine if this LSA is new or a copy of one it already has. B determines this primarily by examining a sequence number included in the LSA itself. If this is a new (or updated) LSA, B will initiate the same process to flood the changed LSA to C.
OSPF 可以描述为
OSPF can be described as
• 通过配置和本地信息了解可到达的目的地(主动协议)
• Learning about reachable destinations through configuration and local information (a proactive protocol)
• 使用泛洪来同步泛洪域中每个中间系统的数据库(链路状态协议)
• Using flooding to synchronize the database at every intermediate system in the flooding domain (a link state protocol )
• 使用Dijkstra 的 SPF 算法计算无环路路径
• Calculating loop-free paths using Dijkstra’s SPF algorithm
•通过在其 hello 数据包中携带“看到的邻居”列表来验证邻居形成中的双向连接
• Validating two-way connectivity in neighbor formation by carrying a list of “neighbors seen” in its hello packets
•通过在 hello 数据包中携带 MTU 来验证邻接形成时的 MTU
• Validating the MTU at adjacency formation by carrying the MTU in the hello packet
OSPF 广泛应用于小型和大型网络,包括零售、服务提供商、金融和许多其他业务。
OSPF is widely used in small- and large-scale networks, including retail, service provider, financial, and many other businesses.
前面几节考虑了 OSPF 和 IS-IS 的那些方面,这些方面的差异足以保证单独的解释。然而,OSPF 和 IS-IS 已经以足够相似的方式实现了许多功能,因此可以将它们的解决方案视为简单的变体。其中包括多路访问链路的处理、最短路径树的概念化方式以及处理双向连接检查的方式。
The preceding sections have considered those aspects of OSPF and IS-IS that are different enough to warrant separate explanations. There are, however, a number of things OSPF and IS-IS have implemented in similar enough ways to consider their solutions as simple variants. These include the handling of multiaccess links, the way the Shortest Path Tree is conceptualized, and the way two-way connectivity checks are handled.
多路访问链路(例如以太网)是连接设备“共享”可用带宽的链路,每个设备都可以将数据包直接发送到连接到同一链路的任何其他设备。多路访问链路对通过链路同步数据库的协议提出了特殊的挑战;用图16-8进行说明。
Multiaccess links, such as Ethernet, are links where attached devices “share” the available bandwidth, and each device can send packets directly to any other device connected to the same link. Multiaccess links pose special challenges for protocols that synchronize a database across the link; Figure 16-8 is used to explain.
在多路访问链路上运行时,协议可以使用的一种选择是简单地形成邻接关系,就像通常在点对点链路上一样。例如,在图16-8中:
One option a protocol could use when running over a multiaccess link is to simply form adjacencies as it normally would over a point-to-point link. For instance, in Figure 16-8:
• A 可以与B、C 和D 形成邻接。
• A can form an adjacency with B, C, and D.
• B 可以与A、C 和D 形成邻接。
• B can form an adjacency with A, C, and D.
• C 可以与A、B 和D 形成邻接。
• C can form an adjacency with A, B, and D.
• D 可以与A、B 和C 形成邻接。
• D can form an adjacency with A, B, and C.
如果使用这种邻接形成模式,当 A 从某个未连接到共享链路的路由器接收到新的 LSP 片段 (IS-IS) 或 LSA (OSPF) 时:
If this pattern of adjacency formation is used, when A receives a new LSP fragment (IS-IS) or LSA (OSPF) from some router not connected to the shared link:
• A 将新的分片或LSA 分别发送给B、C 和D。
• A will transmit the new fragment or LSA to B, C, and D separately.
• 当B 收到分片或LSA 时,它会分别向C 和D 发送新的分片或LSA。
• When B receives the fragment or LSA, it will transmit the new fragment or LSA to C and D separately.
• 当C收到分片或LSA时,它会将新的分片或LSA传输给D。
• When C receives the fragment or LSA, it will transmit the new fragment or LSA to D.
考虑到每个片段或LSA的传输,以及随后的CSNP或确认以确保本地数据库在每个路由器处同步,大量数据包必须穿过共享链路以确保每个设备的数据库同步。为了减少多路访问链路上的泛洪,IS-IS 和 OSPF 选择一个设备来负责确保连接到该链路的每个设备都具有同步的数据库。图16-8中,对于IS-IS:
Given the transmission of each fragment or LSA, and the following CSNP or acknowledgment to ensure the local database is synchronized at each router, a large number of packets must cross the shared link to ensure every device’s database is synchronized. To reduce the flooding on multiaccess links, IS-IS and OSPF elect a single device that is responsible for ensuring every device connected to the link has a synchronized database. In Figure 16-8, for IS-IS:
• 选择单个设备来管理链路上的泛洪。在IS-IS中,该设备称为指定中间系统(DIS)。
• A single device is elected to manage flooding on the link. In IS-IS, this device is called the Designated Intermediate System (DIS).
• 每个具有新链路状态信息的设备都会将片段发送到多播地址,以便共享链路上的每个设备都会收到它。连接到该链路的设备在收到更新的片段时都不会发送任何类型的确认。
• Each device with new link state information sends the fragment to a multicast address so every device on the shared link will receive it. None of the devices connected to the link send acknowledgments of any kind when they receive the updated fragment.
• DIS 定期向同一多播地址发送其 CSNP 副本,因此多路访问链路上的每个设备都会收到其副本。
• The DIS sends out a copy of its CSNP on a regular basis to the same multicast address, so every device on the multiaccess link receives a copy of it.
• 如果共享链路上的任何设备发现它丢失了某些特定片段,则根据CSNP 中DIS 数据库的描述,它将向链路发送PSNP,请求丢失的信息。
• If any device on the shared link finds it is missing some specific fragment, based on the description of the DIS’s database in the CSNP, it will send a PSNP onto the link requesting the missing information.
• 如果共享链路上的任何设备发现它拥有DIS 没有的信息,则根据CSNP 中DIS 数据库的描述,它会将丢失的片段洪泛到该链路上。
• If any device on the shared link finds it has information the DIS does not have, based on the description of the DIS’s database in the CSNP, it will flood the missing fragment onto the link.
通过这种方式,新的链路状态信息在链路上传播的次数最少。图16-8中,对于OSPF:
In this way, new link state information is flooded across the link a minimal number of times. In Figure 16-8, for OSPF:
• 选择一个设备来管理链路上的泛洪,称为指定路由器(DR)。还选举出一个备份设备,称为备份指定路由器(BDR——创意,对吧?)。
• A single device is elected to manage flooding on the link, called the Designated Router (DR). A backup device is elected, as well, called the Backup Designated Router (BDR—creative, right?).
• 每个具有新链路状态信息的设备将其洪泛到由 DR 和 BDR(全 DR 路由器)监控的特殊组播地址。
• Each device with new link state information floods it to a special multicast address monitored by the DR and BDR (all-DR-routers).
• DR 接收此LSA,检查它以确定它是否包含新信息,然后将其重新洪泛到链路上所有OSPF 路由器(所有SPF 路由器)侦听的多播地址。
• The DR receives this LSA, examines it to determine if it contains new information, and then refloods it to a multicast address that all the OSPF routers on the link listen to (all-SPF-routers).
然而,DIS 或 DR 的选举不仅仅影响多路访问链路上的信息泛洪;它还会影响通过链路计算 SPF 的方式。图 16-9说明了这一点。
The election of a DIS or DR does not, however, just impact the flooding of information on the multiaccess link; it also impacts the way SPF is calculated through the link. Figure 16-9 illustrates.
在图16-9中,A被选为多路访问电路的DIS或DR。A 不仅确保链路上的每个设备都有一个同步数据库,而且还创建一个伪节点或 p 节点,并将其通告为就像连接到网络的真实设备一样。连接到共享链路的每个路由器都会向 p 节点通告连接,而不是向每个其他连接的系统通告连接。
In Figure 16-9, A is elected as the DIS or DR for the multiaccess circuit. A not only ensures every device on the link has a synchronized database, but it also creates a pseudonode, or p-node, and advertises it as if it were a real device attached to the network. Each of the routers connected to the shared link advertises connectivity to the p-node, rather than to each of the other connected systems.
在IS-IS中,A为p节点创建LSP;该 p 节点将零成本链路通告回连接到多路访问链路的每个设备。在 OSPF 中,A 创建网络 LSA(类型 2)。
In IS-IS, A creates an LSP for the p-node; this p-node advertises a zero-cost link back to each device attached to the multiaccess link. In OSPF, A creates a Network LSA (type 2).
如果没有这个 p 节点,网络对于洪泛域中的其他中间系统来说就像是一个完整的网格,如图16-9左侧所示。对于 p 节点,网络看起来是一个中心辐射型网络,其中 p 节点作为中心。每个设备向 p 节点通告一条链路,链路成本设置为共享链路上的本地接口成本。作为回报,p 节点将零成本链路通告回连接到共享链路的每个设备。这降低了跨大规模多路访问链路计算 SPF 的复杂性。
Without this p-node, the network looks like a full mesh to the other intermediate systems in the flooding domain, as shown on the left side of Figure 16-9. With the p-node, the network appears to be a hub-and-spoke network, with the p-node as the hub. Each device advertises a link toward the p-node, with the link cost being set to the local interface cost onto the shared link. The p-node, in return, advertises a zero-cost link back to each device connected to the shared link. This reduces the complexity of calculating SPF across large-scale multiaccess links.
链路状态协议的一个令人困惑的方面是节点、链路和可达性如何相互交互。如图 16-10所示。
One confusing aspect of link state protocols is how the nodes, links, and reachability interact with one another. Figure 16-10 illustrates.
在 OSPF 和 IS-IS 中,节点和链路都被用作最短路径树,如深色实线所示。虚线显示了可达性信息如何附加到每个节点。连接到特定可到达目的地的每个节点都会通告该目的地——不仅仅是连接到点对点链路的两个节点之一,而是两个节点。为什么是这样?
In both OSPF and IS-IS, the nodes and links are used to be a Shortest Path Tree, as shown in the darker, solid lines. The dashed lines show how reachability information is attached to each node. Every node connected to a particular reachable destination advertises the destination—not just one of the two nodes connected to a point-to-point link, but both of them. Why is this?
主要原因是这只是宣传可到达目的地的最简单的解决方案。如果您想构建一个路由协议,仅将每个可到达的目的地通告为连接到单个设备,则您需要找到某种方法来选择哪些连接的设备应通告可到达的目的地。此外,如果所选设备发生故障,则其他一些设备必须接管通告可到达的目的地,这可能会花费时间并对收敛产生负面影响。最后,通过允许每个设备通告所有连接目的地的可达性,您实际上可以找到到达每个目的地的最短路径。
The primary reason is this is just the easiest solution to advertising the reachable destinations. If you wanted to build a routing protocol that only advertised each reachable destination as connected to a single device, you would need to find some way to elect which of the connected devices should advertise the reachable destination. Further, if the elected device fails, then some other device must take over advertising the reachable destination, which can take time and impact convergence in a negative way. Finally, by allowing each device to advertise reachability to all connected destinations, you can actually find the shortest path to each destination.
然而,对于一些工程师来说,每台设备都通告每个本地可到达的目的地是很难理解的。
That each device advertises each locally reachable destination is difficult for some engineers to wrap their minds around, however.
双向连接对于两个不同位置的控制平面来说是一个问题:相邻设备之间以及计算通过网络的无环路路径时。IS-IS 和 OSPF 还可确保计算无环路路径时双向连接到位。
Two-way connectivity is a problem for control planes in two distinct places: between adjacent devices and when calculating loop-free paths through the network. Both IS-IS and OSPF also ensure two-way connectivity is in place when computing loop-free paths.
基本要素是反向链接检查。如图 16-11所示。
The essential element is a backlink check. Figure 16-11 illustrates.
在图16-11中,每个链接的方向都用箭头(或一组箭头)标记。[A,B] 链路单向指向 A;其余链接是双向连接的(双向)。当计算 SPF 时,D 将执行以下操作:
In Figure 16-11, the direction of each link is labeled with an arrow (or set of arrows). The [A,B] link is unidirectional toward A; the remaining links are two-way connected (bidirectional). When computing SPF, D will do the following:
• 在处理C 的链路状态信息时,注意C 声称连接到B。D 将查找B 的链路状态信息并检查以确保B 也声称连接到C。在这种情况下,B 确实声称连接到C ,因此 D 将使用 [B,C] 链接。
• When processing C’s link state information, note C claims to be connected to B. D will find B’s link state information and check to make certain B also claims to be connected to C. In this case, B does claim to be connected to C, so D will use the [B,C] link.
• 在处理B 的链路状态信息时,注意B 声称连接到A。但是,D 检查A 的链路状态信息时,找不到任何来自A 声称连接到B 的信息。因此,D 不会使用[ A,B]链接。
• When processing B’s link state information, note B claims to be connected to A. Examining A’s link state information, however, the D cannot find any information from A claiming to be connected to B. Because of this, D will not use the [A,B] link.
此检查通常在链接移动到 TENT 之前或链接从 TENT 移动到 PATH 之前完成。
This check is normally done either before a link is moved to the TENT or before a link is moved from the TENT onto the PATH.
1989 年 1 月,在德克萨斯州奥斯汀举行的第 12 届互联网工程任务组 (IETF) 会议上,Yakov Rekhter 和 Kirk Lougheed 坐在一张桌子旁,不久之后,一种新的外部网关路由协议诞生了,即 BGP。最初的 BGP 设计记录在一张餐巾纸上,据传餐巾纸上溅满了番茄酱。餐巾纸上的设计扩展到三张手写纸,第一个可互操作的 BGP 实现很快就开发出来了。
In January 1989 at the 12th Internet Engineering Task Force (IETF) meeting in Austin, Texas, Yakov Rekhter and Kirk Lougheed sat down at a table and in a short time a new exterior gateway routing protocol was born, the BGP. The initial BGP design was recorded on a napkin rumored to have been heavily spattered with ketchup. The design on the napkin was expanded to three handwritten sheets of paper from which the first interoperable BGP implementation was quickly developed.
BGP 最初被设计为外部网关协议 (EGP),这意味着它旨在连接网络或自治系统 (AS),而不是设备。如果 BGP 是 EGP,则这必定意味着其他路由协议(如 RIP、EIGRP、OSPF 和 IS-IS)必须是内部网关协议 (IGP)——这是一个“固定”的名称。事实证明,明确定义内部和外部网关对于设计和运营大型网络非常有用。在广泛部署的协议中,BGP 的独特之处在于它的无环路路径计算。存在三种广泛使用的距离矢量协议(生成树、RIP 和 EIGRP)。有两种广泛使用的链路状态协议(OSPF 和 IS-IS)。在可能被视为利基市场的领域中开发和部署这两类协议的例子还有很多。然而,BGP
BGP was originally designed to be an Exterior Gateway Protocol (EGP), which means it was intended to connect networks, or Autonomous Systems (ASes), rather than devices. If BGP is an EGP, this must mean that the other routing protocols, like RIP, EIGRP, OSPF, and IS-IS, must be Interior Gateway Protocols (IGPs)—a designation that “stuck.” Clearly defining interior and exterior gateways has proven useful in designing and operating large-scale networks. BGP is unique among the widely deployed protocols in its loop-free path calculation. There are three widely used distance vector protocols (Spanning Tree, RIP, and EIGRP). There are two widely used link state protocols (OSPF and IS-IS). And there are many more examples of these two types of protocols developed and deployed in what might be considered niche markets. BGP, however, is the only widely deployed path vector protocol.
EGP 最重要的目标是什么?第一个显然是选择无环路径,但这显然并不意味着最短路径。最短路径在 EGP 中不如在 IGP 中那么重要的原因是 EGP 用于连接实体,例如服务提供商、内容提供商和企业网络。在这个级别连接网络意味着关注策略,而不是效率——从复杂性的角度来看,通过策略机制增加状态,同时从纯粹的流量承载角度减少整体网络优化。
What are the most important goals for an EGP? The first is obviously selecting loop-free paths, but this clearly does not mean the shortest path. The reason the shortest path is not as important in an EGP as it is in an IGP is that EGPs are used to connect entities, such as service providers, content providers, and corporate networks. Connecting networks at this level means focusing on policy, rather than efficiency—in complexity terms, increasing state through policy mechanisms while reducing overall network optimization in pure traffic-carrying terms.
这里不会深入考虑 BGP 策略机制;第 17 章“控制平面策略”中讨论了一些基本策略概念。本节重点介绍传输、对等互连、通告和 BGP 决策过程。
BGP policy mechanisms will not be considered here in any depth; some basic policy concepts are considered in Chapter 17, “Control Plane Policy.” This section focuses on transport, peering, advertisement, and the BGP decision process.
BGP 不提供任何类型的可靠传输。相反,BGP 依赖 TCP 在 BGP 对等体之间传送信息。使用 TCP 确保
BGP does not provide any sort of reliable transport. Instead, BGP relies on TCP to carry information between BGP peers. Using TCP ensures
• 处理MTU 检测,即使对于跨越多个跃点(或路由器)的连接也是如此。
• MTU detection is handled, even for connections crossing several hops (or routers).
• 流量控制由底层传输负责,因此 BGP 不需要直接进行流量控制(尽管大多数 BGP 实现确实与本地主机上的 TCP 堆栈进行交互,以专门提高 BGP 的吞吐量)。
• Flow control is taken care of by the underlying transport, so BGP does not need flow control directly (although most BGP implementations do interact with the TCP stack on the local host to improve throughput for BGP specifically).
• 对等点之间的双向连接通过 TCP 中实现的三向握手来确保。
• Two-way connectivity between peers is ensured by the three-way handshake implemented in TCP.
尽管 BGP 依赖底层 TCP 连接来实现控制平面在建立邻接关系时必须解决的许多功能,但仍有许多功能 TCP 无法提供。因此,仍然需要更全面地了解 BGP 对等过程;如图 16-12所示。
Even though BGP relies on an underlying TCP connection for many of the functions control planes must solve in building adjacencies, there are still a number of functions TCP cannot provide. Therefore, a fuller look at the BGP peering process is still in order; Figure 16-12 illustrates.
在图 16-12中:
In Figure 16-12:
1. BGP 对等会话在空闲状态下开始。
1. The BGP peering session begins in the idle state.
2. A 发送 TCP open 端口 179;B 响应 A 上的临时端口。 TCP 三向握手完成后(TCP 会话成功),BGP 转移对等状态以进行连接。如果对等会话是跨某种类型的基于状态的过滤(例如防火墙)形成的,则从过滤设备的“内部”传输 TCP 打开非常重要。
2. A sends a TCP open on port 179; B responds to an ephemeral port on A. After the TCP three-way handshake is completed (the TCP session is successful), BGP moves the peering state to connect. If the peering session is being formed across some type of state-based filtering, such as a firewall, it is important that the TCP open be transmitted from the “inside” of the filtering device.
3. 如果 TCP 连接失败,BGP 对等状态将转为活动状态。
3. If the TCP connection fails, the BGP peering state is moved to active.
4. A 向 B 发送 BGP open 消息,并将 B 转至 opensent 状态。此时,A 正在等待 B 发送 keepalive。如果 B 在特定时间内没有发送 keepalive,A 会将会话移回空闲状态。open消息包含许多参数,例如两个BGP发言者支持哪些地址族以及保持定时器。这称为能力协商。选择两个通告的最低(最小)保持定时器作为对等会话的保持定时器。
4. A sends a BGP open to B and moves B to the opensent state. At this point, A is waiting on B to send a keepalive. If B does not send a keepalive within a specific period, A will move the session back to the idle state. The open message contains a number of parameters, such as which address families the two BGP speakers support and the hold timer. This is called capabilities negotiation. The lowest (minimum) hold timer of the two advertised is selected as the hold timer for the peering session.
5. 当B 向A 发送keepalive 消息时,A 将B 移至openconfirm 状态。
5. When B sends A a keepalive, A moves B to the openconfirm state.
6. 此时A会向B发送keepalive来验证连接。当 A 和 B 收到彼此的 keepalive 时,对等会话将转至已建立状态。
6. At this point, A will send B a keepalive to verify the connection. When A and B receive one another’s keepalives, the peering session will move to the established state.
7. 两个 BGP 发言者交换路由,因此它们的表是最新的。A 和 B 仅交换其最佳路径,除非两个扬声器支持并配置某种形式的 BGP 多路径。
7. The two BGP speakers exchange routes, so their tables are up to date. A and B only exchange their best paths, unless some form of BGP multipath is supported and configured on the two speakers.
8. 为了通知 A 已完成发送其整个本地表,B 向 A 发送表结束 (EOT) 或 RIB 结束 (EOR) 信号。
8. To notify A it has finished sending its entire local table, B sends A an End of Table (EOT) or End of RIB (EOR) signal.
BGP 对等关系有两种:同一自治系统(AS,通常指单个管理域内的路由器集合,尽管这是一个相当宽松的定义)内的 BGP 对等体称为内部 BGP (iBGP) 对等体,以及自治系统之间的 BGP 对等体称为外部(或外部)BGP (eBGP) 对等体。虽然这两种BGP对等体关系的建立方式相同,但它们的通告规则不同。
There are two kinds of BGP peering relationships: BGP peers within the same Autonomous System (AS, which generally means the set of routers within a single administrative domain, though this is a rather loose definition) are called internal BGP (iBGP) peers, and BGP peers between autonomous systems are called external (or exterior) BGP (eBGP) peers. While the two kinds of BGP peering relationships are built the same way, they have different advertisement rules.
由于 BGP 旨在互连自治系统,因此最佳路径算法主要关注策略,而不是环路自由度。事实上,如果您检查 BGP 最佳路径过程的任何标准解释,就会发现特定路径是否无环路根本不包含在决策过程中。那么,BGP 如何确定特定对等方正在通告无环路路由呢?如图 16-13所示。
As BGP is designed to interconnect autonomous systems, the best path algorithm is focused primarily on policy, rather than loop free-ness. In fact, if you examine any standard explanation of the BGP best path process, whether or not a particular path is loop free is not included in the decision process at all. How, then, does BGP determine a particular peer is advertising a loop-free route? Figure 16-13 illustrates.
在图16-13中,每个路由器都位于单独的AS中,因此每对BGP发言者将形成一个eBGP对等会话。A 连接到 2001:db8:3e8:100::/64,将这条路由通告给 B 和 C。 BGP 路由通告携带许多属性,其中之一是 AS Path(其他属性将在后面的描述中讨论)最佳路径选择过程)。A 在向 B 发布 100::/64 之前,会将自己的 AS 号添加到 AS Path 属性中。B收到路由并发布给D;在将路由发布到 D 之前,它会将 AS65001 添加到 AS 路径中。然后,从 A 到 C 的 AS 路径在每一跳看起来都像这样:
In Figure 16-13, each router is in a separate AS, so every pair of BGP speakers will form an eBGP peering session. A, which is connected to 2001:db8:3e8:100::/64, advertises this route toward B and C. BGP route advertisements carry a number of attributes, one of which is the AS Path (others will be discussed later in describing the best path selection process). Before A advertises 100::/64 to B, it adds its AS number into the AS Path attribute. B receives the route and advertises it to D; before advertising the route to D, it adds AS65001 to the AS Path. The AS Path then, tracing from A through C, looks something like this at every hop:
• B 收到:[AS65000]
• As received by B: [AS65000]
• C 收到:[AS65000、AS65001]
• As received by C: [AS65000, AS65001]
• D 收到:[AS65000、AS65001、AS65003]
• As received by D: [AS65000, AS65001, AS65003]
当D收到B发来的路由后,会将其通告给C(BGP中没有水平分割)。假设 C 由于某种原因又将路由通告回 A(在这种情况下不会,因为经过 A 的路径将是到达目的地的更好路径,但这只是为了说明环路预防),A 将检查 AS Path并发现其本地AS在AS Path中。这显然是一个环路,因此 A 直接忽略该路由。由于该路由被忽略,因此它永远不会被放置在 BGP 拓扑表中;因此,仅使用 BGP 最佳路径过程来比较无环路路由。
When D received the route from B, it will advertise it back to C (there is no split horizon in BGP). Assume C, in turn, advertises the route back to A for some reason (it would not in this situation, because the path through A would be a better path to the destination, but just to illustrate loop prevention), A will examine the AS Path and discover its local AS is in the AS Path. This is clearly a loop, so A simply ignores the route. Since this route is ignored, it is never placed in the BGP topology table; hence only loop-free routes are compared using the BGP best path process.
在大多数实现中,BGP 最佳路径过程由 13 个步骤组成(第一步并不总是实现,因为它是 BGP 发言人的本地决定):
The BGP best path process consists of 13 steps in most implementations (the first step is not always implemented, as it is a local decision on the part of the BGP speaker):
1. 选择权重最高的路线。一些实现不实现路由权重。
1. The route with the highest weight is chosen. Some implementations do not implement a route weight.
2. 选择本地优先级(LOCAL PREF)最高的路由。本地优先级表示本地 AS 的退出策略,即该 AS的所有者(例如 BGP 发言者)会选择可用退出点中的哪个退出点。
2. The route with the highest local preference (LOCAL PREF) is chosen. The local preference represents the exit policy of the local AS—which exit point out of the available exit points would the owner of this AS like the BGP speaker prefer.
3. 首选本地发起的路由,即在此 BGP 发言者上。此步骤在决策过程中很少使用。
3. Prefer the locally originated route, which means on this BGP speaker. This step is rarely used in the decision process.
4. 优先选择AS路径最短的路径。此步骤的目的是通过选择经过最少数量的自治系统到达目的地的路径,优先选择通过互联网的最有效路径。操作员经常在前面添加AS 路径条目来影响决策过程中的这一步。
4. Prefer the path with the shortest AS Path. This step is intended to prefer the most efficient path through the internetwork, by choosing the path that will pass through the smallest number of autonomous systems to reach the destination. Operators often prepend AS Path entries to influence this step in the decision process.
5. 优先选择源类型最低的路径。从 IGP 重新分发的路由优先于来源未知的路由。此步骤很少对决策过程产生任何影响。
5. Prefer the path with the lowest origin type. Routes that are redistributed from an IGP are preferred over routes with an unknown origin. This step rarely has any impact on the decision process.
6. 优先选择具有最低多出口鉴别器 (MED) 的路径。MED代表远端AS的入口策略。因此,仅当从同一相邻 AS 接收到多条路由时才会比较 MED;如果从两个不同的相邻自治系统接收到相同的路由,则忽略 MED。
6. Prefer the path with the lowest multiexit discriminator (MED). The MED represents the entrance policy of the remote AS. As such, the MED is only compared if multiple routes have been received from the same neighboring AS; if the same route is received from two different neighboring autonomous systems, the MED is ignored.
7. 优先选择 eBGP 路由而不是 iBGP 路由。
7. Prefer eBGP routes over iBGP routes.
8. 优先选择IGP开销最低的路由作为下一跳。如果没有设置本地出口策略(以本地优先的形式),并且相邻AS也没有设置入口策略(以MED的形式),则选择距离本地路由器最近的出口路径出口点。
8. Prefer the route with the lowest IGP cost to the next hop. If no local exit policy is set (in the form of the local preference), and the neighboring AS has not set an entrance policy (in the form of the MED), then the path with the closest exit from the local router is chosen as the exit point.
9. 确定是否应在路由表中安装多个路径(配置某种形式的多路径)。
9. Determine if multiple paths should be installed in the routing table (some form of multipath is configured).
10. 如果比较两条外部路由(从 eBGP 对等体获悉),则优先选择最旧的路由,或先获悉的路由。此规则可以防止仅仅因为路由刷新而导致路由流失。
10. If comparing two external routes (learned from an eBGP peer), prefer the oldest route, or the route learned first. This rule prevents route churn just because routes are refreshed.
11. 优先选择从具有最小Router ID的对等体学到的路由。这只是一个防止路由表混乱的决胜局。
11. Prefer the route learned from the peer with the lowest router ID. This is simply a tiebreaker to prevent churn in the routing table.
12. 优先选择簇长度最短的路由(有关簇的说明,请参阅下一节)。
12. Prefer the route with the shortest cluster length (see the next section for an explanation of the cluster).
13. 优先选择从具有最低对等地址的对等方获知的路由。这又是一个简单的平局断路器,任意选择是为了防止平局并导致路由表混乱,并且通常在两个 BGP 对等体通过两条并行链路连接时使用。
13. Prefer the route learned from the peer with the lowest peering address. This is, again, simply a tie breaker, chosen arbitrarily to prevent ties and cause churn in the routing table, and would normally be used when two BGP peers are connected over two parallel links.
虽然这看起来是一个漫长的过程,但 BGP 中的几乎每个最佳路径决策都可以归结为四个因素:本地偏好、MED、AS 路径长度和 IGP 成本。
While this seems like a long process, almost every best path decision in BGP comes down to four factors: the local preference, the MED, the AS Path length, and the IGP cost.
笔记
Note
如果这个过程还不够复杂,BGP 已经扩展到支持运营商能想到的几乎所有最佳路径决策方案。有关详细信息,请参阅BGP 自定义决策流程。4这些自定义决策功能可以确定哪条路径是此处描述的任何决策点之前或之后的最佳路径。
If this process isn’t complex enough, BGP has been extended to support almost any best path decision scheme an operator can think of. See BGP Custom Decision Process for more information.4 These custom decision capabilities can determine which path is the best path before, or after, any of the decision points described here.
BGP 有两个简单的规则来确定向何处通告路由:
BGP has two simple rules to determine where to advertise a route:
• 向每个eBGP 对等体通告到每个目的地的最佳路径。
• Advertise the best path to every destination to every eBGP peer.
• 将从eBGP 对等体获知的最佳路径通告给每个iBGP 对等体。
• Advertise the best path learned from an eBGP peer to every iBGP peer.
这两条规则的另一种表达方式是:永远不要将从 iBGP 学到的路由通告给另一个 iBGP 对等体。如图 16-14所示。
Another way to put these two rules is this: never advertise a route learned from an iBGP to another iBGP peer. Figure 16-14 illustrates.
图16-14中,A和B是eBGP对等体,B和C、C和D是iBGP对等体。假设A向B通告2001:db8:3e8:100::/64。由于B收到了这条路由来自 eBGP 对等体的通告,它将向 C (iBGP 对等体)通告 100::/64。然而,C 在获悉此路由后,不会将该路由通告给 D,因为 C 从 iBGP 对等体接收到该路由,而 D 也是 iBGP 对等体。那么,在此示例中,D 将不会了解 100::/64。这在现实世界中似乎不太有用;然而,这种限制是有原因的。
In Figure 16-14, A and B are eBGP peers, while B and C, and C and D, are iBGP peers. Assume A advertises 2001:db8:3e8:100::/64 to B. Since B received this route advertisement from an eBGP peer, it will advertise 100::/64 to C, which is an iBGP peer. C, on learning this route, will not advertise the route to D, however, as C received the route from an iBGP peer, and D is also an iBGP peer. In this illustration, then, D will not learn about 100::/64. This does not seem very useful in the real world; however, the restriction is there for a reason.
考虑一下 BGP 如何通过在路由通告本身中携带路由经过的自治系统列表来防止形成路由环路。当从一个 iBGP 发言者向另一发言者通告路由时,AS 路径没有变化。如果iBGP发言者将从iBGP对等体学到的路由发布给iBGP对等体,很容易形成路由环路。解决这个问题的一种方法是简单地构建一个多跳B 和 D 之间的对等关系(请记住,BGP 运行在 TCP 之上;只要两个 BGP 发言者之间存在 IP 连接,它们就可以建立对等关系)。假设B跨C与D建立对等关系,并且B和D均未与C建立对等关系,那么当D将流量切换到100::/64到C时会发生什么?C 处此流中的数据包会发生什么情况?C 将没有到 100::/64 的路由,因此它将丢弃流量。这可以通过多种方式解决,例如,B 和 D 可以通过 C 传输流量,因此 C 不需要能够到达外部目的地。BGP 还可以配置为将路由重新分配到正在运行的任何底层 IGP(这是一个坏主意!—不要这样做)。
Consider how BGP prevents routing loops from forming—by carrying a list of the autonomous systems through which the route has passed in the route advertisement itself. When advertising a route from one iBGP speaker to another, there is no change in the AS Path. If iBGP speakers advertised routes learned from iBGP peers to iBGP peers, routing loops can easily be formed. One solution to this problem is simply to build a multihop peering relationship between B and D (remember that BGP runs on top of TCP; so long as there is IP connectivity between two BGP speakers, they can build a peering relationship). Assume that B builds a peering relationship with D across C, and neither B nor D builds a peering relationship with C. What will happen when traffic is switched toward 100::/64 by D toward C? What will happen to packets in this flow at C? C will not have a route to 100::/64, so it will drop the traffic. This can be solved in a number of ways—for instance, B and D could tunnel the traffic across C, so C does not need to have reachability to the external destination. BGP could also be configured to redistribute routes into whatever underlying IGP is running (this is a bad idea!—do not do this).
BGP 路由反射器被标准化来解决这个问题。图16-15说明了路由反射器的操作。
BGP route reflectors were standardized to resolve this problem. Figure 16-15 illustrates the operation of route reflectors.
图16-15中,E配置为路由反射器;B、C、D配置为路由反射器客户端(具体为E的客户端)。A向B发布2001:db8:3e8:100::/64路由;B 将此路由通告给 E,因为它是从 eBGP 对等体收到的,而 E 是 iBGP 对等体。E向路由添加一个新属性,即簇列表,它指示AS内通过路由反射器簇进行更新的路径。然后,E 将向其每个客户端通告该路由。在这种情况下,循环预防由集群列表处理。
In Figure 16-15, E is configured as a route reflector; B, C, and D are configured as route reflector clients (specifically, as clients of E). A advertises the 2001:db8:3e8:100::/64 route to B; B advertises this route to E, because it was received from an eBGP peer, and E is an iBGP peer. E adds a new attribute to the route, a cluster list, which indicates the path of the update within the AS through the route reflector clusters. E will then advertise the route to each of its clients. Loop prevention, in this case, is handled by the cluster list.
虽然 BGP 最初设计用于互连自治系统,但其用途已扩展到数据中心结构、网络核心,并承载有关虚拟专用网络的信息。事实上,BGP 的用途几乎是无限的。因此,您将在以后的许多章节中遇到 BGP。一路走来,BGP 已经成为一个非常复杂的协议;本节刚刚开始概述协议的操作。
While BGP was originally designed to interconnect autonomous systems, its use has spread to data center fabrics, network cores, and carrying information about virtual private networks. The uses to which BGP has been put are, in fact, almost limitless; hence, you will encounter BGP in a number of future chapters. Along the way, BGP has become a very complex protocol; this section barely begins to sketch the operation of the protocol.
BGP可以描述为
BGP can be described as
• 主动协议,通过配置、本地信息和其他协议了解可到达的目的地
• A proactive protocol that learns about reachable destinations through configuration, local information, and other protocols
• 路径矢量协议,仅向每个邻居通告最佳路径,并且不会防止自治系统内的环路(除非部署路由反射器或某些附加功能)
• A path vector protocol that advertises only the best path to each neighbor and does not prevent loops within an autonomous system (unless route reflectors or some additional feature is deployed)
• 通过检查可到达目的地的路径来选择无环路路径
• Selecting loop-free paths by examining the path through which the destination can be reached
• 通过使用 TCP 作为传输来验证双向连接和 MTU
• Validating two-way connectivity and MTU through its use of TCP as a transport
分布式控制平面只能在短短的两章中触及皮毛。不过,希望这些章节能让您了解计算无环路径的问题到底有多复杂,以及有多少种可能的解决方案这个问题集有。不过,只要记住基本分类,就可以快速掌握任何路由协议的基本操作:
It is only possible to scratch the surface of distributed control planes in two short chapters. Hopefully, however, these chapters give you a sense of how complex the problem of calculating loop-free paths really is and how many possible solutions to this problem set there are. So long as you remember the basic classifications, however, you can quickly grasp the basic operation of any routing protocol:
• 它如何了解并通告有关拓扑和可到达目的地的信息?该协议是被动式的还是主动式的?
• How does it learn about and advertise information about topology and reachable destinations? Is the protocol reactive or proactive?
• 运行该协议的设备如何发现运行相同协议的其他设备?它是如何形成邻居的?
• How do devices running the protocol discover other devices running the same protocol? How does it form neighbors?
• 协议如何检测MTU 不匹配?
• How does the protocol detect MTU mismatches?
• 协议如何通过网络可靠地分发路由信息?
• How does the protocol distribute routing information reliably through the network?
• 协议如何整理数据?
• How does the protocol marshal data?
• 协议如何删除拓扑和可达性信息?
• How does the protocol remove topology and reachability information?
• 协议如何确保邻居级别和计算无环路路径时的双向连接?
• How does the protocol ensure two-way connectivity, both at the neighbor level and when calculating loop-free paths?
• 协议如何计算无环路路径?
• How does the protocol calculate loop-free paths?
如果您想更深入地了解每个或任何这些协议,您应该考虑“进一步阅读”部分中的资源。
You should consider the resources in the “Further Reading” section if you would like to understand each or any of these protocols in greater depth.
钱德拉、拉维和约翰·斯卡德。使用 BGP-4 进行功能通告。征求意见 5492。RFC 编辑器,2009。https: //rfc-editor.org/rfc/rfc5492.txt。
Chandra, Ravi, and John Scudder. Capabilities Advertisement with BGP-4. Request for Comments 5492. RFC Editor, 2009. https://rfc-editor.org/rfc/rfc5492.txt.
陈恩克、托尼·J·贝茨和拉维·钱德拉。BGP 路由反射:全网状内部 BGP (IBGP) 的替代方案。征求意见 4456。RFC 编辑器,2006。https: //rfc-editor.org/rfc/rfc4456.txt。
Chen, Enke, Tony J. Bates, and Ravi Chandra. BGP Route Reflection: An Alternative to Full Mesh Internal BGP (IBGP). Request for Comments 4456. RFC Editor, 2006. https://rfc-editor.org/rfc/rfc4456.txt.
陈恩克、约翰·斯卡德、阿尔瓦罗·雷塔纳和丹尼尔·沃尔顿。BGP 中多路径的通告。征求意见 7911。RFC 编辑器,2016。https: //rfc-editor.org/rfc/rfc7911.txt。
Chen, Enke, John Scudder, Alvaro Retana, and Daniel Walton. Advertisement of Multiple Paths in BGP. Request for Comments 7911. RFC Editor, 2016. https://rfc-editor.org/rfc/rfc7911.txt.
陈恩克和奎扎尔·沃拉。BGP 支持四个八位组 AS 编号空间。征求意见 4893。RFC 编辑器,2007 年。https: //rfc-editor.org/rfc/rfc4893.txt。
Chen, Enke, and Quaizar Vohra. BGP Support for Four-octet AS Number Space. Request for Comments 4893. RFC Editor, 2007. https://rfc-editor.org/rfc/rfc4893.txt.
Chunduri、Uma、Wenhu Lu、Albert Tian 和 Naiming Shen。IS-IS 扩展序列号 TLV。征求意见 7602。RFC 编辑器,2015。https: //rfc-editor.org/rfc/rfc7602.txt。
Chunduri, Uma, Wenhu Lu, Albert Tian, and Naiming Shen. IS-IS Extended Sequence Number TLV. Request for Comments 7602. RFC Editor, 2015. https://rfc-editor.org/rfc/rfc7602.txt.
多伊尔、杰夫和詹妮弗·德黑文·卡罗尔。路由 TCP/IP,第 1 卷。第二版。印第安纳州印第安纳波利斯:思科出版社,2005 年。
Doyle, Jeff, and Jennifer DeHaven Carroll. Routing TCP/IP, Volume 1. 2nd edition. Indianapolis, IN: Cisco Press, 2005.
弗格森、丹尼斯、阿西·林德姆和约翰·莫伊。用于 IPv6 的 OSPF。征求意见 5340。RFC 编辑器,2008。https: //rfc-editor.org/rfc/rfc5340.txt。
Ferguson, Dennis, Acee Lindem, and John Moy. OSPF for IPv6. Request for Comments 5340. RFC Editor, 2008. https://rfc-editor.org/rfc/rfc5340.txt.
金斯伯格、莱斯、斯蒂芬·利特科斯基和斯特凡诺·普雷维迪。用于扩展 IP 和 IPv6 可达性的 IS-IS 路由首选项。征求意见 7775。RFC 编辑器,2016。https: //rfc-editor.org/rfc/rfc7775.txt。
Ginsberg, Les, Stephane Litkowski, and Stefano Previdi. IS-IS Route Preference for Extended IP and IPv6 Reachability. Request for Comments 7775. RFC Editor, 2016. https://rfc-editor.org/rfc/rfc7775.txt.
海茨、雅各布、基尤尔·帕特尔、乔布·斯奈德斯、伊格纳斯·巴格多纳斯和尼克·希利亚德。“BGP 大型社区。” 互联网草案。互联网工程任务组,2017 年 1 月。https ://tools.ietf.org/html/draft-ietf-idr-large-community-12。
Heitz, Jakob, Keyur Patel, Job Snijders, Ignas Bagdonas, and Nick Hilliard. “BGP Large Communities.” Internet-Draft. Internet Engineering Task Force, January 2017. https://tools.ietf.org/html/draft-ietf-idr-large-community-12.
“与提供无连接模式网络服务的协议结合使用的中间系统到中间系统域内路由信息交换协议。” 标准。瑞士日内瓦:国际标准化组织,2002 年。http: //standards.iso.org/ittf/PubliclyAvailableStandards/。
“Intermediate System to Intermediate System Intra-Domain Routing Information Exchange Protocol for Use in Conjunction with the Protocol for Providing the Connectionless-Mode Network Service.” Standard. Geneva, CH: International Organization for Standardization, 2002. http://standards.iso.org/ittf/PubliclyAvailableStandards/.
卡茨、戴夫. “OSPF 和 IS-IS:比较剖析。” 于 2000 年 6 月 12 日在新墨西哥州阿尔伯克基举行的 NANOG19 上发表。https://nanog.org/meetings/abstract?id=1084。
Katz, Dave. “OSPF and IS-IS: A Comparative Anatomy.” Presented at the NANOG19, Albuquerque, NM, June 12, 2000. https://nanog.org/meetings/abstract?id=1084.
麦克弗森、丹尼·R.和基尤尔·帕特尔。拥有 BGP-4 协议的经验。征求意见 4277。RFC 编辑器,2006。https: //rfc-editor.org/rfc/rfc4277.txt。
McPherson, Danny R., and Keyur Patel. Experience with the BGP-4 Protocol. Request for Comments 4277. RFC Editor, 2006. https://rfc-editor.org/rfc/rfc4277.txt.
迈耶、大卫和基尤·帕特尔。BGP-4 协议分析。征求意见 4274。RFC 编辑器,2006。https: //rfc-editor.org/rfc/rfc4274.txt。
Meyer, David, and Keyur Patel. BGP-4 Protocol Analysis. Request for Comments 4274. RFC Editor, 2006. https://rfc-editor.org/rfc/rfc4274.txt.
Mirtorabi、Sina、Abhay Roy、Acee Lindem 和 Fred Baker。“OSPFv3 LSA 可扩展性。” 互联网草案。互联网工程任务组,2016 年 10 月。https ://tools.ietf.org/html/draft-ietf-ospf-ospfv3-lsa-extend-13。
Mirtorabi, Sina, Abhay Roy, Acee Lindem, and Fred Baker. “OSPFv3 LSA Extendibility.” Internet-Draft. Internet Engineering Task Force, October 2016. https://tools.ietf.org/html/draft-ietf-ospf-ospfv3-lsa-extend-13.
莫伊,约翰· T。OSPF 版本 2。征求意见 2328。RFC 编辑,1998。https: //rfc-editor.org/rfc/rfc2328.txt。
Moy, John T. OSPF Version 2. Request for Comments 2328. RFC Editor, 1998. https://rfc-editor.org/rfc/rfc2328.txt.
帕克、杰夫. 使用中间系统到中间系统 (IS-IS) 的互操作网络的建议。征求意见 3719。RFC 编辑器,2004 年。https ://rfc-editor.org/rfc/rfc3719.txt。
Parker, Jeff. Recommendations for Interoperable Networks Using Intermediate System to Intermediate System (IS-IS). Request for Comments 3719. RFC Editor, 2004. https://rfc-editor.org/rfc/rfc3719.txt.
Przygienda,Antoni B 博士。中间系统到中间系统 (ISIS) 中的可选校验和。征求意见 3358。RFC 编辑器,2002。https: //rfc-editor.org/rfc/rfc3358.txt。
Przygienda, Dr. Antoni B. Optional Checksums in Intermediate System to Intermediate System (ISIS). Request for Comments 3358. RFC Editor, 2002. https://rfc-editor.org/rfc/rfc3358.txt.
Ramachandra、Srihari S. 和 Yakov Rekhter。BGP 扩展社区属性。征求意见 4360。RFC 编辑器,2006 年。https: //rfc-editor.org/rfc/rfc4360.txt。
Ramachandra, Srihari S., and Yakov Rekhter. BGP Extended Communities Attribute. Request for Comments 4360. RFC Editor, 2006. https://rfc-editor.org/rfc/rfc4360.txt.
拉祖克、罗伯特、克里斯蒂安·卡萨尔、布鲁诺·德克莱恩、斯蒂芬·利特科斯基、凯文·王和埃里克·阿曼。“BGP 最佳路由反射 (BGP-ORR)。” 互联网草案。互联网工程任务组,2017 年 1 月。https ://tools.ietf.org/html/draft-ietf-idr-bgp-optimal-route-reflection-13。
Raszuk, Robert, Christian Cassar, Bruno Decraene, Stephane Litkowski, Kevin Wang, and Erik Aman. “BGP Optimal Route Reflection (BGP-ORR).” Internet-Draft. Internet Engineering Task Force, January 2017. https://tools.ietf.org/html/draft-ietf-idr-bgp-optimal-route-reflection-13.
雷赫特、雅科夫、苏珊·黑尔斯和托尼·李。边界网关协议 4 (BGP-4)。征求意见 4271。RFC 编辑器,2006。https: //rfc-editor.org/rfc/rfc4271.txt。
Rekhter, Yakov, Susan Hares, and Tony Li. A Border Gateway Protocol 4 (BGP-4). Request for Comments 4271. RFC Editor, 2006. https://rfc-editor.org/rfc/rfc4271.txt.
雷塔纳、阿尔瓦罗和拉斯·怀特。“BGP 自定义决策流程。” 互联网草案。互联网工程任务组,2017 年 2 月。https: //tools.ietf.org/html/draft-ietf-idr-custom-decision-08。
Retana, Alvaro, and Russ White. “BGP Custom Decision Process.” Internet-Draft. Internet Engineering Task Force, February 2017. https://tools.ietf.org/html/draft-ietf-idr-custom-decision-08.
罗伊、阿拜、易阳和阿尔瓦罗·雷塔纳。在 OSPF 中隐藏仅传输网络。
Roy, Abhay, Yi Yang, and Alvaro Retana. Hiding Transit-Only Networks in OSPF.
征求意见 6860。RFC 编辑器,2013。https: //rfc-editor.org/rfc/rfc6860.txt。
Request for Comments 6860. RFC Editor, 2013. https://rfc-editor.org/rfc/rfc6860.txt.
迈克·尚德、斯特凡诺·普雷维迪、莱斯·金斯伯格和丹尼·R·麦克弗森。IS-IS 链路状态 PDU (LSP) 空间的简化扩展。征求意见 5311。RFC 编辑器,2009。https: //rfc-editor.org/rfc/rfc5311.txt。
Shand, Mike, Stefano Previdi, Les Ginsberg, and Danny R. McPherson. Simplified Extension of Link State PDU (LSP) Space for IS-IS. Request for Comments 5311. RFC Editor, 2009. https://rfc-editor.org/rfc/rfc5311.txt.
沃拉、奎萨尔和陈恩克。BGP 支持四个八位字节的自治系统 (AS) 编号空间。征求意见 6793。RFC 编辑器,2012。https: //rfc-editor.org/rfc/rfc6793.txt。
Vohra, Quaizar, and Enke Chen. BGP Support for Four-Octet Autonomous System (AS) Number Space. Request for Comments 6793. RFC Editor, 2012. https://rfc-editor.org/rfc/rfc6793.txt.
丹尼尔·沃尔顿、阿尔瓦罗·雷塔纳、恩克·陈和约翰·斯卡德。BGP 持续路由振荡的解决方案。征求意见 7964。RFC 编辑器,2016。https: //rfc-editor.org/rfc/rfc7964.txt。
Walton, Daniel, Alvaro Retana, Enke Chen, and John Scudder. Solutions for BGP Persistent Route Oscillation. Request for Comments 7964. RFC Editor, 2016. https://rfc-editor.org/rfc/rfc7964.txt.
王、莉莉、张朝晖(杰弗里)和尼沙尔·谢思。OSPF 混合广播和点对多点接口类型。征求意见 6845。RFC 编辑器,2013。https: //rfc-editor.org/rfc/rfc6845.txt。
Wang, Lili, Zhaohui (Jeffrey) Zhang, and Nischal Sheth. OSPF Hybrid Broadcast and Point-to-Multipoint Interface Type. Request for Comments 6845. RFC Editor, 2013. https://rfc-editor.org/rfc/rfc6845.txt.
怀特、拉斯. 中间系统到中间系统 (IS-IS) 路由协议实时课程。视频。现场课程。思科出版社,2016 年。http ://www.ciscopress.com/store/intermediate-system-to-intermediate-system-is-is-routing-9780134465326 ?link=text&cmpid=2017_02_02_CP_RussWhiteVideo 。
White, Russ. Intermediate System to Intermediate System (IS-IS) Routing Protocol LiveLessons. Video. LiveLessons. Cisco Press, 2016. http://www.ciscopress.com/store/intermediate-system-to-intermediate-system-is-is-routing-9780134465326?link=text&cmpid=2017_02_02_CP_RussWhiteVideo.
拉斯·怀特、丹尼·麦克弗森和斯里哈里·桑利。实用BGP。马萨诸塞州波士顿:Addison-Wesley Professional,2004 年。
White, Russ, Danny McPherson, and Srihari Sangli. Practical BGP. Boston, MA: Addison-Wesley Professional, 2004.
怀特、拉斯和阿尔瓦罗·雷塔纳。IS-IS:IP 网络中的部署。第一版。马萨诸塞州波士顿:艾迪生韦斯利,2003 年。
White, Russ, and Alvaro Retana. IS-IS: Deployment in IP Networks. 1st edition. Boston, MA: Addison-Wesley, 2003.
1. 为什么 IS-IS 在以太网等多路访问链路上发送“邻居可见”的接口地址,在点对点链路上发送 IS 标识符?不同形式的双向连接检查背后的原因是什么?
1. Why does IS-IS send interface addresses as “neighbors seen” on multiaccess links like Ethernet, and IS identifiers on point-to-point links? What is the reasoning behind the different forms of two-way connectivity checks?
2. IS-IS 具有两种度量标准:窄度量和宽度量。描述用于在这两种度量类型之间转换的机制。有效果吗?它与 EIGRP 采用的解决方案相比如何?它是否会遇到与 EIGRP 转换机制相同的故障模式?
2. IS-IS carries two kinds of metrics—narrow and wide. Describe the mechanism used to transition between these two metric types. Is it effective? How does it compare to the solution adopted by EIGRP? Does it suffer from the same sorts of failure modes as the EIGRP transition mechanism?
3. IS-IS LSP 可能会变得比基于 LSP 标头中的“LSP 大小”字段所允许的最大大小更长。描述RFC5311如何解决这个问题。您还能想到其他方法来解决同样的问题吗?
3. It is possible that an IS-IS LSP might become longer than the maximum size allowed based on the “size of LSP” field in the LSP header. Describe how RFC5311 solves this problem. Are there any other ways you can think of to solve this same problem?
4. IS-IS 和 OSPF 依靠序列号来指示通过网络洪泛的哪条信息是最新的。阅读 RFC7602 和 RFC5310。描述这种依赖造成的问题以及 IS-IS 如何解决这个问题。RFC7602中标准化的解决方案是否存在问题?
4. IS-IS and OSPF rely on sequence numbers to indicate which piece of information being flooded through the network is the most recent. Read RFC7602 and RFC5310. Describe the problem caused by this reliance and how IS-IS resolved this problem. Are there problems with the solution standardized in RFC7602?
5. 使用本书前面描述的复杂性模型(状态/优化/表面)比较 OSPF 和 IS-IS 数据编组。您认为这两个协议在哪些方面为了优化而牺牲了状态?多种 LSA 类型和对 IP 分段的依赖是否代表了增加 OSPF 复杂性的交互面?
5. Compare OSPF and IS-IS data marshalling using the complexity model described earlier in the book (state/optimization/surface). Where do you think these two protocols have traded off state for optimization? Do the multiple LSA types and reliance on IP fragmentation represent an interaction surface that increases the complexity of OSPF?
6. 描述 OSPF 和 IS-IS 的链路状态超时和重新泛洪行为所产生的安全问题。查找并描述 IETF 中提出的解决方案。
6. Describe the security issue created by the link state age-out and reflood behavior of both OSPF and IS-IS. Find and describe the solution proposed in the IETF.
7. 考虑在被视为广播介质的点对点链路(例如点对点以太网链路)上进行 DIS/DR 选择。选举 DR/DIS 并创建伪节点会降低还是增加整体复杂性?OSPF 和 IS-IS 的商业实施中实施了哪些功能来减轻这种影响?
7. Consider DIS/DR election on a point-to-point link that is considered a broadcast medium (such as a point-to-point Ethernet link). Will electing a DR/DIS and creating a pseudonode reduce overall complexity or increase it? What features have been implemented in commercial implementations of OSPF and IS-IS to mitigate the result?
1 . Katz,“OSPF 和 IS-IS:比较解剖学。”
1. Katz, “OSPF and IS-IS: A Comparative Anatomy.”
2 . White 和 Retana,IS-IS:IP 网络中的部署。
2. White and Retana, IS-IS: Deployment in IP Networks.
3 . Mirtorabi 等人,“OSPFv3 LSA 可扩展性”。
3. Mirtorabi et al., “OSPFv3 LSA Extendibility.”
4 . Retana 和 White,“BGP 自定义决策流程”。
4. Retana and White, “BGP Custom Decision Process.”
最后几章考虑了通过网络寻找一组无环路路径的多种变体。然而,在边界网关协议(BGP)的解释中,您可能已经注意到强调的是各种策略,而不是严格寻找无环路路径。那么,本章将继续强调上一章开始的政策。
The last several chapters have considered the many variations on finding a set of loop-free paths through a network. In the explanation of the Border Gateway Protocol (BGP), however, you might have noticed the emphasis on various policies, rather than strictly finding loop-free paths. This chapter, then, will continue the emphasis on policy begun in the preceding chapter.
首先要回答的问题是:什么是政策?不幸的是,没有简单的答案。回答这个问题的最好方法是通过例子;这些将在下一节中考虑。本章第二部分将吸取这些例子的教训,然后考虑控制平面策略空间中的问题和解决方案。
The first question to answer is: what is policy? Unfortunately, there is no simple answer. The best way to answer this question is through examples; these will be considered in the following section. The second section of this chapter will draw lessons from these examples, and then consider problems and solutions in the control plane policy space.
控制策略通常很难在概念上与数据平面策略分开,例如数据包过滤和服务质量 (QoS)。事实上,两者在很多地方都有重叠,例如控制平面携带 QoS 标记,然后将其应用于数据包,或者将数据包绘制到空接口中,从而有效地丢弃它们。为了清楚起见,这里避免了这些极端情况。
Control policy is often difficult to separate conceptually from data plane policy, such as packet filtering and Quality of Service (QoS). In fact, the two overlap in many places, such as the control plane carrying QoS markings that are then applied to packets, or drawing packets into a null interface, effectively dropping them. These sorts of corner cases are avoided here for clarity.
通常理解概念的最佳方式是通过示例。本节研究在控制平面中使用策略来满足业务需求的三个示例:确定流量应从提供商网络的何处退出、通过固定大象流来优化应用程序性能以及通过网络分段提高或提供安全性。下一节将从这些示例中吸取一系列教训。
Often the best way to understand a concept is through examples. This section examines three examples of policy being used in the control plane to fulfill business requirements: determining where traffic should exit a provider network, optimizing application performance by pinning elephant flows, and increasing or providing security through network segmentation. The next section draws a set of lessons from these examples.
服务提供商通常生活在预算、应用程序要求和业务驱动因素紧张的世界中。在不同类型的提供商之间进行路由时,这三者的混合可能会导致一些奇怪的情况。具体来说,冷土豆路由旨在尽可能长时间地将流量保留在提供商网络内,而热土豆路由旨在将流量推送到尽可能最近的出口点。将这两者混合的结果有时被称为(开玩笑的)土豆泥路由。用图17-1来说明。
Service providers normally live within a world of tight budgets, application requirements, and business drivers. The mixture of these three can make for some strange situations when routing between providers of various kinds. Specifically, cold potato routing is designed to keep traffic inside the provider’s network for as long as possible, while hot potato routing is designed to push traffic to the closest exit point possible. The result of mixing these two is sometimes called (tongue in cheek) mashed potato routing. Figure 17-1 is used to explain.
假设 AS65000 是边缘提供商,或者可能是连接到两个上游提供商 AS65001 和 AS65003 的“企业”网络。AS65001、AS65003 和 AS65004 是传输提供商,AS65002 是内容提供商。该网络集合中的一些策略和这些策略的业务驱动因素可能是
Assume AS65000 is an edge provider, or perhaps an “enterprise” network connected to two upstream providers, AS65001 and AS65003. AS65001, AS65003, and AS65004 are transit providers, and AS65002 is a content provider. Some of the policies and business drivers for those policies in this collection of networks might be
• AS65001 希望通过链路C 从AS65000 获取尽可能多的流量。该链路填满的次数越多,AS65000 购买升级链路的可能性就越大。当然,除了尝试说服 AS65000 中的管理员向其方向发送更多流量,或者尝试从某种流量工程系统的角度提高链路性能之外,实际上 AS65001 几乎无法吸引流量。 AS65000 可能已在链路的一端进行了配置。
• AS65001 wants to draw as much traffic from AS65000, across link C, as possible. The more this link is filled up, the more likely AS65000 is to purchase an upgraded link. There is actually little AS65001 can do to attract traffic, of course, other than perhaps trying to convince the administrators in AS65000 to ship more traffic their direction, or trying to improve the performance of the link from the perspective of some sort of traffic engineering system AS65000 might have configured on their end of the link.
• AS65001 希望将在链路C 进入的所有流量转发到最近的出口(具有到达目的地的路由)。例如,如果 AS65003 正在通告到图右侧 K 的路由,则 AS65001 将首选通过链路 D 退出,即使它可能不是到目的地的最短总体路径。通常,AS65001 会使用 BGP 的本地首选项或依靠底层内部网关协议 (IGP) 指标来实施此类策略,以将流量吸引到自治系统 (AS) 之外最近的出口点。这称为“热点路由”。为什么AS65001想要将流量推送到最近的出口点并有到达目的地的路由?例如,因为沿着通往链路 H 的路径传送流量会消耗网络资源。AS65001 是根据链路 C 的使用情况付费的,而不是实际将流量运送到尽可能接近目的地的位置。因此,AS65001 将尽可能多地吸引付费客户的流量,然后将流量推至最近的出口点。
• AS65001 wants to forward any traffic entering at link C to the closest exit with a route to the destination. For instance, if AS65003 is advertising a route to K, on the right side of the diagram, AS65001 will prefer the exit through link D, even though it might not be the shortest overall path to the destination. Normally, AS65001 would implement this sort of policy using BGP’s local preference, or by relying on the underlying Interior Gateway Protocol (IGP) metric to draw traffic to the closest exit point out of the Autonomous System (AS). This is called hot potato routing. Why does AS65001 want to push the traffic to the nearest exit point with a route to the destination? Because carrying the traffic along the path to link H, for instance, consumes network resources. AS65001 is being paid based on the usage of link C, rather than for actually carrying the traffic as close as possible to the destination. Hence, AS65001 will draw as much traffic as possible off its paying customers but then push the traffic to the nearest exit point.
• 另一方面,AS65002 通常希望尽可能严格地控制其用户体验,因为它正在销售服务。如果服务和用户之间的网络质量较差,那么服务本身就会被认为质量较差,并且内容提供商的整体业务将受到影响。流量在 AS65002 网络中停留的时间越长,内容提供商对服务交付质量的控制力就越强。将客户的流量保留在网络内本质上是让客户的眼球更接近服务本身。这是冷土豆路由的一种形式。您不是尽快将流量扔出网络(就像处理烫手山芋一样),而是尽可能长时间地保留流量(就像冷土豆一样)。在这种情况下,AS65002 将感知距离客户最近的出口点是在链路 H 处通过 AS65001,因为通过 H 的路径具有最短的 AS 路径。尽管内部路径较长,但AS65002会选择通过链路H的路径,以尽可能长地控制流量。
• AS65002, on the other hand, generally wants to control its user’s experience as tightly as possible, because it is selling a service. If the network between the service and the user has poor quality, then the service itself is perceived to be poor quality, and the content provider’s overall business will suffer. The longer the traffic stays in AS65002’s network, the more control the content provider has over the Quality of Service delivery. Keeping the customer’s traffic inside the network is essentially bringing the customer’s eyeballs closer to the service itself. This is a form of cold potato routing. Instead of tossing the traffic out of your network as quickly as possible (as you would with a hot potato), you hold on to the traffic as long as possible (like a cold potato). In this case, AS65002 is going to perceive the closest exit point to the customer as being through AS65001 at link H because the path through H has the shortest AS path. Although the internal path is longer, AS65002 will choose the path through link H to control the traffic as long as possible.
• 当在链路 H 接收到流量时,AS65001 需要决定是将流量发送到附近的某个出口点(例如链路 F),还是沿着整个网络承载流量,以便其在链路 C 处退出。在这种情况下,AS65001 将几乎总是决定沿着整个网络传输流量。同样,AS65001 相对于 AS65000 的主要卖点是提高链路 C 上的平均利用率。为此,AS65001 需要向 AS65000 发送流量;因此,AS65001 需要向 AS65000 发送流量。获得此流量的唯一方法是将流量从网络的每个入口点传输到与客户的连接(如果目的地是客户)。这又是冷土豆路由。
• When traffic is received at link H, AS65001 needs to decide whether to send the traffic to some nearby exit point, say link F, or to carry the traffic along the entire network so it exits at link C. In this case, AS65001 will almost always decide to carry the traffic along its entire network. Again, the primary selling point AS65001 has toward AS65000 is to increase the average utilization along link C. To do this, AS65001 needs traffic to send toward AS65000; the only way to get this traffic is to carry traffic from every entry point into the network to the connection to the customer, if the destination is a customer. This is again cold potato routing.
通过最少数量的链路(因此通过最少数量的网络设备)转发流量将消耗最少的资源,但冷路由和热土豆路由都会选择一些较长的路径以满足策略约束。这是用资源的有效利用来换取协议或服务的有效运行,以增加收入。其他策略也可能适用于路由系统,例如选择具有最高带宽的路径,或者将流量带到最接近用户的地理出口点的路径。无论何种策略,它通常代表一种优化与另一种优化之间的权衡,并且需要某种额外的状态来实现。
Forwarding traffic over the fewest number of links (and therefore through the fewest number of network devices) would consume the smallest amount of resources, but cold and hot potato routing both choose some longer-length path in order to satisfy a policy constraint. This trades the efficient use of resources for the efficient operation of a protocol or service in order to increase revenue. Other policies may be applicable to routing systems, as well, such as choosing the path with the highest bandwidth, or the path that takes traffic to the geographic exit point closest to the user. Whatever the policy, it will generally represent a tradeoff between one kind of optimization over some other kind of optimization, and require additional state of some sort to implement.
很多时候,网络在逻辑上被划分以控制对特定资源的访问。以图17-2所示的网络为例进行说明。
Many times networks are logically divided to control access to specific resources. The network shown in Figure 17-2 will be used to illustrate.
图 17-2显示了三种不同的网络:
Figure 17-2 shows three different networks:
• 网络A 显示基本(路由)拓扑。
• Network A shows the base (routed) topology.
• 网络B 显示一组必须相互连接的设备和链路。
• Network B shows one set of devices and links that must be connected to one another.
• 网络C 显示必须相互连接的第二组设备和链路。
• Network C shows a second set of devices and links that must be connected to one another.
在图 17-2中,主机 B 必须只能连接到服务器 L,主机 A 必须只能连接到 H。通过在 G 和 K 处配置的简单数据包过滤器来提供这种分段非常简单,当然可以,但进一步的要求可能会排除使用简单的数据包过滤器。例如:
In Figure 17-2, host B must only be able to connect to server L, and host A must only be able to connect to H. It is simple enough to provide this kind of segmentation through simple packet filters configured at G and K, of course, but further requirements may rule out using simple packet filters. For instance:
• 可能需要A 和H 之间通过的流量使用路径[C,E,F,K,G];这将流量工程要求与服务访问要求混合在一起。
• There may be a requirement for traffic passing between A and H to use the path [C,E,F,K,G]; this mixes a traffic engineering requirement with a service access requirement.
• 可能要求服务器 H 和 L 甚至无法看到来自其他拓扑的路由和其他信息。如果两台服务器参与路由控制平面,这可能会非常困难,如果它们托管许多虚拟机 (VM),每个虚拟机都需要将自己的 IP 地址通告到控制平面,情况可能会如此。
• There may be a requirement for servers H and L to not even be able to see routes and other information from the other topology. This might be very difficult if the two servers are participating in the routed control plane, as might be the case if they are hosting many virtual machines (VMs), each of which needs to advertise its own IP address into the control plane.
当您达到此级别的要求时,常见的解决方案是创建覆盖网络,通常使用隧道来完成分离网络的繁重工作分成几个虚拟拓扑。在图17-2中,这些要求可以通过覆盖来满足。在网络 B 中,将从面向主机 B(隧道头端)的 D 的入站接口开始构建隧道。该隧道将穿过 G 和 K,最终终止于 K 上连接到 L 的接口(隧道尾端)。为了将流量从 B 吸引到 L,必须有一些路由控制平面将流量拉入隧道头端,以便将流量穿过隧道路由到 L。在网络 C 中,将从 C 的入站接口开始构建一条隧道,面向A. 隧道穿过 C、E、F、K 和 G,并终止于面向 H 的 G 的出站接口。同样,必须有某种控制平面来跨此隧道提取数据,
When you reach this level of requirement, a common solution is to create an overlay network, often using tunnels to do the heavy lifting of separating the network into several virtual topologies. In Figure 17-2, these requirements are met with an overlay. In network B, a tunnel would be built starting at the inbound interface of D facing host B (the tunnel headend). This tunnel would be carried across G and K, finally terminating on the interface on K that connects to L (the tunnel tailend). To draw traffic from B to L, there must be some routed control plane to pull traffic into the tunnel headend so it is routed across the tunnel toward L. In network C, a tunnel would be built starting at the inbound interface of C, facing A. The tunnel is carried across C, E, F, K, and G, and terminates at the outbound interface at G facing H. Again, there must be some control plane to draw data across this tunnel, so traffic sourced from A is pulled into the tunnel at C and is presented to G as a “raw IP packet” (without the tunnel headers) so that G can switch the packet to H.
通过这两个隧道吸引流量的路由信息实际上可以在单独的控制平面中承载。在这种情况下,底层控制平面将提供对隧道端点的可达性,而覆盖控制平面将通过隧道吸引流量。这种控制平面的分离允许不同的拓扑、底层和覆盖层完全分离;这两个控制平面之间不共享可达性和拓扑信息。
The routing information that draws traffic through these two tunnels may actually be carried in a separate control plane. In this case, the underlay control plane will provide reachability to the tunnel endpoints, while the overlay control plane will draw traffic through the tunnel. This separation of control planes allows the different topologies, the underlay and the overlay, to be completely separated; reachability and topology information is not shared between these two control planes.
通过网络传输的流量也是如此。两个流通过隧道分开。不仅流量通过隧道分离,而且流量的路径也通过网络进行设计。隧道以及更全面的覆盖概念对于满足许多不同的策略要求非常有用。这就是覆盖在网络工程中如此广泛使用的原因。
The same is true of the traffic being drawn through the network; the two flows are separated by being tunneled. Not only is the traffic being separated by tunneling, but the path of the flow is also being engineered through the network. Tunneling, and the fuller concept of an overlay, is useful in meeting a lot of different policy requirements; this is why overlays are so widely used in network engineering.
大象流和老鼠流是工程师经常遇到的两类流。大象流通常是大型、持久的数据流。例如,任何占用单个链路可用带宽超过 20% 且持续时间超过两到三分钟的流都可能被归类为大象流。另一方面,鼠标流的带宽要低得多,例如不到任何链路上可用带宽的 1%,并且往往持续很短的时间。大多数流量可以分为大象流和老鼠流。大象流和老鼠流该怎么办?
Elephant flows and mouse flows are two classes of flows that engineers often encounter. An elephant flow is typically a large, persistent data flow. Any flow taking up more than around 20% of the available bandwidth of a single link and persisting for more than two or three minutes might, for instance, be classified as an elephant flow. Mouse flows, on the other hand, are much lower bandwidth, say less than 1% of the available bandwidth on any link, and tend to last for very short periods of time. Most flows of traffic can be divided into elephant and mouse flows. What should be done about elephant and mouse flows?
一种解决方案是交织来自每个流的数据包,这允许每个流公平地访问可用带宽。虽然服务质量 (QoS) 是一种解决方案,但另一种解决方案是将特定流量固定到特定路径,或称为路径固定。图17-3用于进一步说明。
One solution is to interleave the packets from each flow, which allows each flow fair access to the available bandwidth. While Quality of Service (QoS) is one solution, another solution is to pin particular traffic flows to particular paths, or path pinning. Figure 17-3 is used to explain further.
在图 17-3中,A 开始一个持续几个小时的流,消耗链路上 20% 的可用带宽(假设所有链路的带宽相同,100Mbps),并在 H 处终止。大约在同一时刻, B 发送一系列终止于 H 的短期小流。假设仅选择一条路径作为 C 和 G 之间的最佳路径,则两条流将遵循相同的路径,即沿着路径 [C,E,G]。以这种方式混合两个流可能会导致两者在性能方面受到影响。要理解这个问题,最好考虑数据包序列化到线路上的速率:
In Figure 17-3, A begins a flow that will last several hours, and consumes 20% of the available bandwidth on a link (assume all links are the same bandwidth, 100Mbps), and terminates at H. At about the same moment, B sends a series of short-term small flows terminating at H. Given just one path will be chosen as the best between C and G, both flows will follow the same path, say along the path [C,E,G]. Mixing the two flows in this way can cause both to suffer from a performance perspective. To understand the problem, it is best to consider the rate at which packets can be serialized onto the wire:
• 64 字节数据包传输至 100Mbps 链路: 0.05ms
• 64-byte packet onto a 100Mbps link: .05ms
• 1,500 字节数据包传输至 100Mbps 链路: 0.12ms
• 1,500-byte packet onto a 100Mbps link: .12ms
• 9,000 字节数据包传输至 100Mbps 链路: 0.72ms
• 9,000-byte packet onto a 100Mbps link: .72ms
假设整个网络能够处理 9,000 字节的数据包大小(最大传输单元或 MTU 为端到端 9,000 字节),并且大象流实际上正在传送 9,000 字节的数据包。对于鼠标流,假设数据包大小为 64 字节数据包(至少在一个方向)。如果单个鼠标流数据包被捕获在单个大象流数据包后面,则鼠标流数据包将保留 0.72 毫秒,然后才能序列化到物理接口上。如果每个流中始终有一个数据包交替出现,则性能可能会显着降低,但两个应用程序可能仍然运行良好。
Assume the entire network is capable of 9,000-byte packet sizes (the Maximum Transmission Unit, or MTU is 9,000 bytes end to end), and the elephant flow is actually shipping 9,000-byte packets. For the mouse flow, assume the packet size is 64-byte packets (at least in one direction). If a single mouse flow packet is trapped behind a single elephant flow packet, the mouse flow packet will be held for .72ms before it can be serialized onto the physical interface. If there is always one packet from each flow alternating, there can be some significant performance reduction, but both applications would likely still work well enough.
但是,如果两个流之间的交错不是最佳的,会发生什么情况?例如,如果有一系列如图 17-4所示的数据包序列怎么办?
But what happens if the interleaving between the two flows is less than optimal? For instance, what if there is a series something like the sequence of packets shown in Figure 17-4?
鼠标流数据包之间的最短和最长间隔之间的差异为 0.05ms 和 0.288ms。大象流包之间的最短和最长间隔之间的差异为0.05ms和0.15ms。这些变化可能看起来很小,但即使是最小的变化也会表现为端到端的抖动。这种抖动,特别是在较大范围内,对于流量控制和纠错来说是个问题。在这种情况下,尽管大象流量远大于老鼠流量,但两者仍然受到负面影响。类似的问题在数据中心结构中也很常见。如图 17-5所示。
The difference between the shortest and longest spacing between mouse flow packets is .05ms and .288ms. The difference between the shortest and longest spacing between elephant flow packets is .05ms and .15ms. These variations might seem to be minimal, but even minimal variations show up as jitter end to end. This kind of jitter, particularly on a larger scale, is problematic for flow control and error correction. In this case, even though the elephant flow is overwhelmingly larger than the mouse flow, both are still negatively impacted. This same sort of problem is common in data center fabrics, as well. Figure 17-5 illustrates.
在图 17-5中,A 有两个流:一个流向 G 的大象流和一组流向 H 的老鼠流。虽然有足够的带宽来支持跨结构的这两个流,但如果这两个流碰巧都散列到 [B ,C] 链路采用等价多路径 (ECMP) 算法,两个流的交互可能会导致支持的应用程序出现抖动,从而降低性能。
In Figure 17-5, A has two flows: an elephant flow to G and a set of mouse flows to H. While there is plenty of bandwidth to support both flows across the fabric, if both flows happen to be hashed onto the [B,C] link by the equal cost multipath (ECMP) algorithm, the interaction of the two flows can cause jitter for the supported applications, reducing performance.
将大象流固定到 [B,C] 链路并保持其他流量远离此链路,以便流向 F 的流量遵循 [A,B,D,F,H] 路径可以解决这些性能问题。大象流在数据中心环境中往往更为常见。
Pinning the elephant flow to the [B,C] link and keeping other traffic off this link so that the traffic to F follows the [A,B,D,F,H] path can resolve these performance problems. Elephant flows tend to be more common in the data center environment.
如何防止在一条链路上混合两种不同类型的流量所带来的问题?在双连接网络中,一种明显的方法(如图17-3和图 17-5中所示)是以某种方式将其中一个流固定到一条路径上,并从大象流固定的链路中删除其他流到。例如:
How can the problems caused by mixing two different kinds of traffic on a single link be prevented? One obvious way in a two-connected network, such as the ones illustrated in Figure 17-3 and Figure 17-5, is to somehow pin one of the flows onto one path and remove the other flows from the link the elephant flow is pinned to. For instance:
• 在图17-3中,如果两个流之一固定到较长的[C,D,F,G] 链路,而将另一个流保留在较短的[C,E,G] 链路上。
• In Figure 17-3, if one of the two flows is pinned to the longer [C,D,F,G] link, while leaving the other flow on the shorter [C,E,G] link.
• 在图 17-5中,如果大象流固定在一条路径上,例如 [B,C,E],并且可以以某种方式引导老鼠流避开 [B,C,E],因此它们使用其他路径,说[B,D,F]。
• In Figure 17-5, if the elephant flow is pinned to one path, say [B,C,E], and the mouse flows can be somehow directed to avoid [B,C,E] so they use some other path, say [B,D,F].
不仅必须将大象流固定到特定路径,而且还必须防止老鼠流沿着大象流已固定的路径流动。有时,仅允许一个流遵循最短的无环路路径,同时将另一个流固定到某个较长(但仍然无环路)的路径就足够了。然而,这在数据中心结构和其他网络中通常不起作用,在这些网络中,必须设计流量的可用路径成本相等。如果通过路由器在一组等成本路径中随机选择的操作仍然可以将老鼠流放置在与大象流相同的路径上,则将大象流固定到一条路径是没有用的。
Not only must the elephant flow be pinned to a particular path, but the mouse flows must be prevented from flowing along the path the elephant flow has been pinned to. Sometimes just allowing one flow to follow the shortest loop-free path while pinning the other flow to some longer (but still loop-free) path will be sufficient. This does not, however, often work in data center fabrics and other networks where the available paths across which the traffic must be engineered are equal cost. Pinning the elephant flow to one path is not useful if the mouse flows can still be placed on the same path as the elephant flow through the operation of a router randomly choosing among a set of equal cost paths.
为了分离图17-3示例中的两个流,必须有某种方法来区分切换过程中的流。有多种方法可以区分流量,包括
To separate the two flows in the example in Figure 17-3, there must be some way to differentiate the flows during the switching process. There are a number of ways to differentiate the flows, including
•目标地址。在图 17-3中,两个流的目的地都是 H,因此目标地址对于区分两个流没有用处。这并非总是如此。
• The destination address. In Figure 17-3, both flows are destined to H, so the destination address would not be useful for differentiating between the two flows. This is not always the case.
•源地址。在图17-3中,第一个流的源是A,第二个流的源是B。在这种情况下可以使用源地址来区分C处的流。但是,因为主机通常发送数据包(或打开会话)与网络中的许多服务器一样,源地址和目标地址通常一起使用,而不仅仅是源地址。
• The source address. In Figure 17-3, the source of the first flow is A, and the source of the second is B. The source address could be used in this situation to differentiate between the flows at C. However, because hosts normally send packets (or open sessions) with a number of servers in a network, the source and destination addresses are normally used together, rather than just the source address.
•端口号。端口号和协议号通常与主机或服务器上的单个应用程序相关联。通常可以组合源端口号和目标端口号来从特定流中挑选出流量,而不是其中之一。
• The port number. Port numbers and protocol numbers are normally associated with a single application on a host or a server. The source and destination port numbers can often be combined to pick out traffic from a specific flow, rather than one or the other.
这些区分符可以组合成一组标记,唯一地标识运行互联网协议 (IP) 套件(五元组)的网络中的每个流。五元组组成
These differentiators can be combined into a set of markers uniquely identifying every flow in a network running the Internet Protocol (IP) suite, the five tuple. The five tuple consists of
• 源IP 地址
• The source IP address
• 目标IP 地址
• The destination IP address
• 协议号
• The protocol number
• 源端口号
• The source port number
• 目标端口号
• The destination port number
由于每个协议的运行方式不同,源端口或目标端口要么是临时端口,要么是分配给该特定主机的端口。除了检查识别流的各个字段之外,还有其他识别流量的方法。例如,主机 A 可以配置为使用 IPv6 扩展标头中的特定编号或特定的 QoS 位来标记大象流中的所有数据包。在这种情况下,C可以简单地检查IP头中的指定信息,并根据正确字段的内容确定应该将流量切换到哪个链路。
Because of the way each protocol operates, either the source or destination port will be an ephemeral port, or a port assigned to this specific host. There are alternative ways for traffic to be identified other than examining the various fields that identify a flow. For instance, host A could be configured to mark all the packets in an elephant flow with a particular number in an IPv6 extension header, or with specific QoS bits. In this case, C could simply check for the specified information in the IP header, determining which link traffic should be switched to based on the contents of the correct field.
鉴于流量可以彼此区分,有哪些技术可用于沿不同链路提取(或推送)每个流中的流量?经常使用多种方法。
Given the traffic flows can be differentiated from one another, what techniques are available to draw (or push) the traffic in each flow along a different link? Several methods are often used.
静态配置的数据包过滤器,有时称为策略路由,或基于过滤器的转发规则,可以在图17-3中的C和G处配置。该规则将包含匹配区分流的字段并设置正确的下一跳的逻辑。这个解决方案(显然)需要手动配置;此配置必须随着时间的推移进行管理,包括调整数据包过滤器的应用位置、过滤器匹配哪些流量以及转发匹配的流量。这种过滤器在首次部署时看起来很简单,但随着时间的推移可能会变得难以维护。例如,在图17-3中,检查 F 处的流量模式不会给您提供有关为什么其中一个流量在较长路径上传播的任何线索。您需要将流量追溯到 C,以发现该流量为何沿着该特定路径传递。由于额外的管理和维护问题,自动化解决方案通常是首选。
A statically configured packet filter, sometimes called a policy route, or a filter-based forwarding rule, can be configured at C and G in Figure 17-3. This rule would contain logic that matches on the fields differentiating the flows and sets the correct next hop. This solution (obviously) requires manual configuration; this configuration must be managed over time, including adjusting where the packet filter is applied, what traffic is matched by the filter, and where the matching traffic is forwarded. This kind of filter can seem simple when first deployed, but can become difficult to maintain over time. For instance, in Figure 17-3, examining the traffic pattern at F would give you no clues about why one of the flows was traveling over the longer path. You would need to trace the traffic back to C to discover why this traffic is passing along this particular path. Because of the additional management and maintenance issues, automated solutions are often preferred.
有时可以使用度量操作来沿着不同的路径绘制每个流。在图 17-3中,如果 H 向 A 和 B 发送流量,则可以操纵通过 [C,D,F,G] 的路径,以降低通往 A 的成本,而通过 [C,E,G] 的路径可以通过操纵来降低 B 的成本。该解决方案的一个问题从例子中可以明显看出。从A和B的角度考虑情况;事实上,这两台主机将大象流和老鼠流发送到同一目的地,因此无法使用指标沿着两条可用路径从这两台主机获取流量。该解决方案的第二个问题与前面描述的数据包过滤器的问题类似。如果您检查 F 处的路由表,就会发现两个不同目的地的指标不同的明显原因。同样,您需要将指标差异追溯到 C 或 G 上的某些配置,以发现为什么位于同一组链路两端的两个目标具有两个不同的指标。
Metric manipulation can sometimes be used to draw each flow along a different path. In Figure 17-3, if H is sending traffic to A and B, the path through [C,D,F,G] could be manipulated to have a lower cost toward A, while the path through [C,E,G] could be manipulated to have a lower cost toward B. One problem with this solution is obvious from the example. Consider the situation from the perspective of A and B; these two hosts are, in fact, sending both the elephant and the mouse flows to the same destination, so there is no way to use metrics to draw traffic from these two hosts along the two available paths. A second problem with this solution is similar to the one described previously with a packet filter. If you examine the routing table at F, there would be no obvious reason why the metrics for the two different destinations are different. Again, you would need to trace back the difference in metrics to some configuration on either C or G to discover why two destinations that appear to be on either end of the same set of links have two different metrics.
一个流中的数据包可以通过网络进行隧道传输,或者绘制到虚拟覆盖拓扑中。在大多数网络中,与虚拟覆盖拓扑相比,隧道更有可能用于解决大象流问题;隧道可以沿着单个路径定向,但虚拟覆盖拓扑可能有许多流量可以采用的路径,因此流量不会固定到特定路径。
The packets in one flow may be tunneled through the network, or drawn into a virtual overlay topology. A tunnel would more likely be used to solve an elephant flow problem than a virtual overlay topology in most networks; a tunnel can be directed along a single path, but a virtual overlay topology is likely to have many paths that traffic could take, so the flow isn’t pinned to a specific path.
上一节中给出的三个用例(或示例)是流量工程的示例,这仅仅意味着操纵控制平面来指定特定流通过网络所采用的路径。在这些示例中要观察的第一点是,对于每种情况,一些流量会从网络中最短(因此可能是最有效)的路径中删除,并以某种方式遵循更长的无环路路径。这个共同元素有助于定义策略:
The three use cases (or examples) given in the previous section are examples of traffic engineering, which simply means manipulating the control plane to specify the path that specific flows take through the network. The first point to observe in these examples is for every case, some traffic is removed from the shortest—and hence presumably the most efficient—path through the network and somehow made to follow a longer loop-free path. This common element is helpful in defining policy:
控制平面策略是导致流量流过比最短路径更长的路径的任何策略,以提供某种形式的优化。
Control plane policy is anything that causes traffic to flow over a path longer than the shortest path in order to provide some form of optimization.
在这个定义中需要进一步考虑一个特定术语:优化到底意味着什么?虽然有许多可能的优化,但它们可以分为四大类:
One specific term needs to be considered further in this definition: what, precisely, does optimization mean? While there are many possible optimizations, they can be broken down into four broad categories:
•网络利用率:运营商有时会尝试优化单个链路的利用率,例如两个数据中心之间利用率较高的路径。网络利用率还可以以更全局的方式进行优化,例如网络中每个链路的平均利用率,或者跨设备和链路的可用交换容量与支持特定业务目标和/或应用程序所需的容量。
• Network utilization: Operators sometimes try to optimize the utilization of a single link, such as a highly utilized path between two data centers. Network utilization can also be optimized in a more global way, such as the average utilization of every link in the network, or perhaps the available switching capacity across devices and links versus the capacity required to support specific business goals and/or applications.
•应用程序支持:应用程序和数据通常是企业的“真正”核心。无论企业声称要做哪种工作,它实际上都是通过信息以某种方式将买家与卖家联系起来,这个过程需要数据和数据处理(或数据分析)。通过确保这组应用程序始终具有可达性或减少抖动和延迟,可以优化网络以支持代表主要业务驱动因素的特定应用程序。
• Application support: Applications and data are often the “real” heart of a business. No matter what kind of work a business claims to do, it actually works with information to connect buyers to sellers in some way, a process requiring data and data processing (or data analytics). A network can be optimized to support specific applications representing the primary business drivers by ensuring this set of applications always has reachability or reduced jitter and delay.
•业务优势:可以优化网络,以某种方式增加企业的财务优势。具体来说,降低企业支付给其他公司的运营成本或通过提高用户参与度或通过连接到新地理位置来开辟新市场来增加收入,可能是网络创造改善业务机会的方式。
• Business advantage: The network can be optimized to increase the financial advantage of the business in some way. Specifically, reducing the cost the business pays to other companies to operate or increasing revenue by increasing user engagement or opening up new markets by connecting to new geographical locations might be ways in which the network can create opportunities to improve the business.
•成本:企业如何构建和维护更便宜的网络?这不仅仅是一个常见问题;这通常是依赖网络的企业关心的唯一问题。
• Cost: How can the business build and maintain a less expensive network? This is not just a common question; it is often the only question the business that relies on the network cares about.
任何特定网络很少会针对这四种网络中的单个类别进行优化。大多数网络将在其拓扑的不同部分甚至不同时间针对所有这四种进行优化。优化通常会跨越这些类别;例如,改进对特定应用程序的支持可以通过允许信息更快地应用于特定业务领域来增加业务优势,同时还可以通过减少应用程序的停机时间来节省成本。
Any particular network will rarely be optimized for a single class among these four. Most networks will be optimized for all four of these in various parts of their topology or even at various times. Optimizations will often cross over these categories; for instance, improving support for a specific application may increase business advantage by allowing information to be applied to a specific area of the business more quickly, while also saving costs by reducing the application’s downtime.
控制平面策略不能免于“三选二”——状态、表面和优化——复杂性权衡,如第 1 章“基本概念”中所述。本章第一部分给出的示例涉及哪些权衡?接下来的部分将介绍每个用例。
Control plane policy is not exempt from the “choose two of three”—state, surface, and optimization—complexity tradeoff described in Chapter 1, “Fundamental Concepts.” What are the tradeoffs involved in the examples given in the first part of this chapter? Each use case will be covered in the sections that follow.
在第一个用例中,使用各种策略机制来管理流量从何处退出 AS,以及在某种程度上管理流量从何处进入 AS。人们很容易忽视 BGP 中携带的属性的复杂性影响,因为它们实际上是BGP 的一部分。从复杂性的角度来看,简单地使用协议中内置的东西如何产生影响?
In the first use case, various policy mechanisms are used to manage where traffic exits an AS and, to some degree, where traffic enters an AS. It is easy to overlook the complexity impacts of the attributes carried in BGP, as they are actually a part of BGP. How can simply using something built into the protocol have an impact from a complexity perspective?
首先,协议本身必须更加复杂,以便支持所携带的属性,并且实现构建、测试和维护所需的代码处理这些属性。这可能看起来很小,但请考虑 BGP 更新打包的情况。图17-6展示了两组BGP报文。
First, the protocol itself must be more complex in order to support the attributes being carried, and implementations build, test, and maintain the code required to process these attributes. This might seem very minor, but consider the case of BGP update packing. Figure 17-6 illustrates two sets of BGP packets.
图17-6中上面的一对数据包,标记为A,是BGP格式携带的两个不同的目的地;有一组属性(在此示例中仅显示一个属性,即 LOCAL_PREF)和一个可达前缀。虽然两个数据包中的可到达目的地不同,但 LOCAL_PREF(或更确切地说属性集)是相同的。因此,当实际将这些前缀通告到目的地时,BGP 可以将这两个前缀打包到单个更新中。为此,只需将两个前缀组合成具有单组属性的单个更新。
The upper pair of packets in Figure 17-6, labeled A, is two different destinations carried in BGP format; there is a set of attributes (in this example only one attribute is shown, the LOCAL_PREF) and a reachable prefix. While the reachable destination is different in the two packets, the LOCAL_PREF, or rather the set of attributes, is the same. Hence, when actually advertising these to destinations, BGP can pack the two prefixes into a single update. To do this, the two prefixes are simply combined into a single update with the single set of attributes.
图 17-6中下面的一对数据包,标记为 B,是 BGP 格式中携带的两个不同的目的地。在这种情况下,可达目的地和属性不同,因此不能将它们合并到单个 BGP 更新中。
The lower pair of packets in Figure 17-6, labeled B, is two different destinations carried in the BGP format. In this case, the reachable destinations and the attributes are different, so they cannot be combined into a single BGP update.
打包更新(如数据包 A 所示)代表在通过 BGP 中的网络传输可达性信息时可节省大量空间。虽然节省的费用因网络而异,但有效打包可减少初始收敛时间和某个地方发送的数据包数量约 80%,这并不奇怪。
Packing updates, as shown with packet A, represents a major space saving when transmitting reachability information through a network in BGP. While the savings will vary between networks, it is not surprising for efficient packing to reduce initial convergence time and the number of packets sent by somewhere around 80%.
添加策略信息的目的是提高网络的利用率,或者更确切地说,移动流量以最大化收入并最小化支出。添加每条路线的决定本质上是为了优化而交换状态。更多的状态被注入到控制平面中,无论是在实际状态量还是在网络上传输状态的效率方面,因此状态与优化的权衡是正确的。
The goal of adding the policy information was to improve the utilization of the network, or rather to move traffic to maximize revenue and minimize expenses. The decision to add per route essentially trades state for optimization. More state is injected into the control plane, both in terms of terms of the actual amount of state and the efficiency of carrying the state across the network, so the state versus optimization tradeoff holds true.
交互表面怎么样?该解决方案在两个地方与网络中的其他系统进行交互。首先,策略标记需要设置在正确的路由上,并与网络中其他位置的某些操作相关联。很大程度上,这些设置将由一双人手添加正确的配置命令来指示 BGP 设置这些路由标记并对其做出反应来完成。人与网络的交互面往往是最大的难以管理的表面。在多出口或多出口鉴别器 (MED) 和发送到 AS 外部的社区的情况下,存在与相邻自治系统的交互表面。
What about interaction surfaces? There are two places where this solution interacts with other systems in the network. First, the policy marker needs to be set on the correct routes, and associated with some action someplace else in the network. Largely, these settings are going to be made by a pair of human hands adding the right configuration commands to instruct BGP to set and react to these route markers. The interaction surface between people and the network is often the most difficult surface to manage. In the case of the multiple exit, or multiexit, discriminator (MED) and communities sent outside the AS, there is an interaction surface with the neighboring autonomous systems.
在这种情况下,路由策略绝对适合状态/优化/表面 (SOS) 模型;提高网络利用率需要增加状态和表面。
Routing policy, in this case, definitely fits within the State/Optimization/Surface (SOS) model; increasing network utilization requires an increase in state and surfaces.
在资源分割中,似乎通过将可达性和拓扑状态从底层拓扑中分离出来,底层拓扑中的状态量减少了,因此整体复杂性也降低了。同时,网络似乎更符合业务需求,因此看起来优化增加了,而状态减少了。这似乎违背了复杂性模型;如果网络变得更加优化,状态应该会增加。
In resource segmentation, it would seem that by splitting the reachability and topology state out of the underlay topology, the amount of state in the underlay topology has been reduced, and hence the overall complexity has been reduced. At the same time, the network appears to be more closely aligned with the business requirements, so it looks like optimization has increased while state has decreased. This seems to go against the complexity model; if the network is becoming more optimized, state should be increasing.
欢迎来到抽象的世界;这是您必须更仔细地考虑事情才能真正理解对复杂性的影响的情况之一。请记住:如果您还没有找到权衡,那么您就没有足够努力地寻找。这里发生的情况是,现在存在三个不同的控制平面,有关整体拓扑的信息较少;系统中的总状态增加了,因为存在关于链路子集的三个状态(一个用于底层物理拓扑,一个用于每个覆盖虚拟拓扑)。肯定有更多的状态;它只是更加“分散”。
Welcome to the world of abstraction; this is one of those cases where you must consider things more closely to really understand the impact on complexity. Remember: if you have not found the tradeoffs, you have not looked hard enough. What is happening here is that there are now three different control planes with less information about the overall topology; the total state in the system has increased, as there are three pieces of state about a subset of the links (one for the underlying physical topology and one for each overlay virtual topology). There is definitely a larger amount of state; it is just more “spread around.”
此外,现在网络中运行着三个控制平面;即使协议在相同链路上不携带相同的可达性信息,协议之间也肯定存在需要考虑的交互面。如图 17-7所示。
Further, there are now three control planes running in the network; there is definitely an interaction surface to consider between the protocols, even if the protocols do not carry the same reachability information over the same links. Figure 17-7 illustrates.
图 17-7表示与图 17-2中所示相同的网络拓扑,其中虚拟覆盖拓扑折叠成单个图。物理拓扑为由灰色实线表示;第一个覆盖层由黑色虚线表示;第二个覆盖层由带有混合点和破折号的黑线表示。当以这种方式说明物理和覆盖拓扑时,很容易看出单个 [G,K] 链路在所有三个拓扑之间共享。如果[G,K]链路发生故障,则两个覆盖拓扑也将发生故障;这就是所谓的命运共享。多个拓扑之间共享的一组链路称为共享风险链路组 (SRLG)。
Figure 17-7 represents the same network topology shown in Figure 17-2, with the virtual overlay topologies collapsed into a single diagram. The physical topology is represented by the solid gray lines; the first overlay is represented by the black dashed line; and the second overlay is represented by the black lines with intermixed dots and dashes. When the physical and overlay topologies are illustrated in this way, it is easy to see the single [G,K] link is shared across all three topologies. If the [G,K] link fails, both of the overlay topologies will also fail; this is called fate sharing. The set of links shared between more than one topology is called the Shared Risk Link Group (SRLG).
这三个控制平面最初可能看起来并不相互作用。然而,它们确实在 [G,K] 链接上进行交互。即使没有实际的链路故障,底层控制平面中的任何故障都会导致两个虚拟拓扑无法在各自的源和目的地之间转发流量。尽管三个控制平面之间不重新分配信息,但它们之间仍然存在交互。
These three control planes may not initially appear to interact with one another. They do, however, interact at the [G,K] link. Even if there is no actual link failure, any failure in the underlay control plane will cause both of the virtual topologies to fail to be able to forward traffic between their respective sources and destinations. There is an interaction between the three control planes even though they do not redistribute information between themselves.
然而,也许更糟糕的是,在如图 17-7所示的双连接网络中,网络中任意两点之间应该始终存在两条路径。单个链路故障不应导致 H 从 A 变得不可达。但是,由于虚拟拓扑 [C,E,F,K,G] 不是两个相连的,因此网络已转换为以下设计:单个链路故障位于物理层可能会导致两个虚拟拓扑断开连接。
What is perhaps worse, however, is that in a two-connected network such as the one shown in Figure 17-7, there should always be two paths between any two points in the network. A single link failure should not cause H to become unreachable from A. Because the virtual topology [C,E,F,K,G] is not two connected, however, the network has been converted to a design where a single link failure at the physical layer can cause both virtual topologies to become disconnected.
在设计具有覆盖层的网络时,这种共享命运的交互界面通常很容易被忽略。这种抽象消除了细节,使得更容易单独“查看”每个拓扑,并减少每个控制平面中包含的状态,但它也隐藏了部署虚拟化之前不存在的故障风险和模式。
This sort of shared fate interaction surface is often easy to miss when designing a network with overlays. The abstraction removes details, making it easier to “see” each topology separately, and reducing the state contained in each control plane, but it also hides failure risks and modes that did not exist before virtualization was deployed.
通常解决此类问题的唯一方法是将状态重新添加到三个控制平面中。例如,在图 17-7中的网络中,可以通过多种方式将状态添加回网络中,以便在 [G,K] 链路故障的情况下提供备用路径。例如:
Often the only way to solve this type of problem is to add state back into the three control planes. For instance, in the network in Figure 17-7, there are a number of ways state could be added back into the network to provide alternate paths in the case of the [G,K] link failure. For instance:
• 一些外部进程可以计算拓扑,然后跨层找到SRLG,或网络中的命运共享点。在这种情况下,另一个控制需要“骑在上面”,至少提醒设计者有关 SRLG 的信息,以便可以修改网络设计以解决它们。该解决方案添加了(实际上)第四个控制平面,该控制平面必须与其他三个控制平面交互,包括第四个控制平面中承载的任何状态。
• Some outside process could calculate the topologies and then reach across the layers to find SRLGs, or fate sharing points in the network. In this case, yet another control needs to “ride on top” to at least alert the designer about the SRLGs, so the network design can be modified to work around them. This solution adds (in effect) a fourth control plane that must interact with the other three, including any state carried in the fourth control plane.
• 可以将两个虚拟拓扑配置为覆盖整个物理拓扑,并对每个覆盖拓扑的链路成本设置某种形式的度量权重,以便流量沿着正确的路径传递。这增加了将整个控制平面带回两个拓扑的状态,并可能增加多个控制平面交互的更多点。此外,必须计算、配置和管理覆盖链路度量。
• The two virtual topologies could be configured to overlay the entire physical topology, and some form of metric weights be placed on the link costs for each overlay topology so traffic passes along the correct path. This adds the state of carrying the entire control plane back to both topologies and potentially adds more points at which the multiple control planes will interact. Further, the overlay link metrics must be computed, configured, and managed.
• 可以在网络上设计和部署辅助虚拟覆盖,以便每个拓扑都有一个预构建的备份拓扑。多协议标签交换 (MPLS) 流量工程 (TE) 快速重路由 (FRR) 提供了此类解决方案。为了部署这种方案,必须增加备份路径的附加状态以及隧道头端的切换状态以快速切换到备份路径;运营商和网络之间额外的潜在交互面、网络中运行的各种控制平面,甚至每个设备上可用的各种交换路径,都会增加网络的复杂性。
• Secondary virtual overlays could be designed and deployed on the network so each topology has a prebuilt backup topology. Multiprotocol Label Switching (MPLS) Traffic Engineering (TE) Fast Reroute (FRR) provides this type of solution. To deploy this kind of solution, additional state for the backup path and switching state at the tunnel headend to quickly switch to the backup path must be added; the additional potential interaction surfaces between the operators and the network, the various control planes now running in the network, and even the various switching paths available at each device, all add complexity back into the network.
归根结底,天下没有免费的午餐。网络分段通常是在客户和工作负载之间提供分离的有效方法——考虑到应用程序和安全要求,它通常是唯一的方法——但在这种设计中总会在某些地方增加复杂性。
There is, in the end, no such thing as a free lunch. Network segmentation is often an effective way to provide separation between customers and workloads—it is often the only way to go, given application and security requirements—but there will always be added complexity someplace in such a design.
从复杂性的角度来看,在流固定示例中有何收益?当然,流固定的主要目的是优化大象流和老鼠流应用程序的性能。至少从交换的角度来看,网络还可以更高效地运行,并且通过分离两种流,QoS 设置可能会更简单。因此,优化有所增加,并且状态和交互表面可能会减少(由于 QoS 配置和处理更简单)。
What are the gains, from a complexity perspective, in the flow pinning example? The primary point of flow pinning is, of course, to optimize the performance of both the elephant and mouse flow applications. The network may also operate more efficiently, at least from a switching perspective, and QoS settings may well be simpler with the two kinds of flows separated. So there is an increase in optimization, and potentially a decrease in state and interaction surfaces (due to the simpler QoS configurations and processing).
为了获得这些改进,其他地方的复杂性必须相应增加。在这种情况下,复杂性的增加是在控制平面状态。大象流必须以某种方式固定到特定链接,并且鼠标流必须以某种方式从大象流已固定的链接中删除。还必须有一些“备份计划”,以防大象流所固定的路径发生故障。
To get these improvements, there must be a corresponding increase in complexity someplace else. In this case, the increase in complexity is in control plane state. The elephant flow must somehow be pinned to a specific link, and the mouse flows must somehow be removed from the link to which the elephant flow has been pinned. There must also be some “backup plan” in case the path to which the elephant flow is pinned fails.
控制平面策略通常以分段、流量固定、流量工程和其他形式隐藏在人们的视线中。一般来说,它会采取将流量从最短路径引导到一条从最低度量或跳数角度来看可能不太理想的路径的形式,但在其他方面(例如应用程序支持)更为最佳。事实上,这与其他任何控制平面策略的定义一样好:
Control plane policy is often hiding in plain sight in the form of segmentation, flow pinning, traffic engineering, and other forms. Generally, it will take the form of directing traffic away from the shortest path, and onto a path that might appear less than optimal from a lowest metric or hop count perspective, but is more optimal in some other way, such as application support. In fact, this is as good of a definition of control plane policy as any other:
控制平面策略是对数据包从最短路径通过网络的路径进行的任何修改,以实现某些特定的业务或应用程序要求。
Control plane policy is any modification to the path that packets take through a network off the shortest path in order to implement some specific business or application requirement.
这里没有考虑另一种形式的控制平面策略:控制平面信息的聚合和总结,以减少状态并划分(或创建)故障域。
There is one more form of control plane policy not considered here: the aggregation and summarization of control plane information in order to reduce state and divide (or create) failure domains.
使用控制平面来实施策略给网络工程师带来了一系列权衡。正如本书前面几章所讨论的,分布式控制平面在承担发现拓扑、提供可达性信息和承载策略的任务时通常会变得非常复杂。下一章将探讨通过集中控制平面的全部或部分功能来解决这些问题的一些替代方法。
Using the control plane to implement policy presents network engineers with a set of tradeoffs. Distributed control planes, as considered in the previous chapters in this book, often become very complex when they are tasked with discovering topology, providing reachability information, and carrying policy. The next chapter explores some alternative ways to solve these problems by centralizing all or part of the functions of the control plane.
这些来源考虑了本章中未讨论的许多其他有趣且有用的控制平面策略,并提供了有关所讨论策略的更多信息。
These sources consider a number of other interesting and useful control plane policies not discussed in this chapter and provide more information on the policies that were discussed.
阿加瓦尔、沙拉德、A.努奇和苏普拉蒂克·巴塔查亚。“衡量 IGP 工程和域间流量的共同命运。” 第13 届 IEEE 国际网络协议会议 (ICNP'05),2005 年 10 月。doi:10.1109/ICNP.2005.22。
Agarwal, Sharad, A. Nucci, and Supratik Bhattacharyya. “Measuring the Shared Fate of IGP Engineering and Interdomain Traffic.” In 13th IEEE International Conference on Network Protocols (ICNP’05), 10, 2005. doi:10.1109/ICNP.2005.22.
卡萨多、马丁和贾斯汀·佩蒂特。“老鼠和大象。” 网络异端,2013 年 11 月 1 日。https: //networkheresy.com/2013/11/01/of-mice-and-elephants/。
Casado, Martin, and Justin Pettit. “Of Mice and Elephants.” Network Heresy, November 1, 2013. https://networkheresy.com/2013/11/01/of-mice-and-elephants/.
Das, VV“分布式拒绝服务蜜罐方案”,497–501,2009。doi:10.1109/ICACC.2009.146。
Das, V. V. “Honeypot Scheme for Distributed Denial-of-Service,” 497–501, 2009. doi:10.1109/ICACC.2009.146.
辛里奇斯、蒂姆和斯科特·洛。“关于数据中心的政策:政策问题。” 网络异端,2014 年 4 月 22 日。https ://networkheresy.com/2014/04/22/on-policy-in-the-data-center-the-policy-problem/。
Hinrichs, Tim, and Scott Lowe. “On Policy in the Data Center: The Policy Problem.” Network Heresy, April 22, 2014. https://networkheresy.com/2014/04/22/on-policy-in-the-data-center-the-policy-problem/.
贾斯汀·佩蒂特、卡纳·拉贾戈帕尔和 JR 里弗斯。“vSwitch 中的大象检测以及底层的性能处理。” 网络异端,2014 年 5 月 16 日。https ://networkheresy.com/2014/05/16/elephant-detection-in-the-vswitch-with-performance-handling-in-the-underlay/。
Justin Pettit, Kanna Rajagopal, and J. R. Rivers. “Elephant Detection in the vSwitch with Performance Handling in the Underlay.” Network Heresy, May 16, 2014. https://networkheresy.com/2014/05/16/elephant-detection-in-the-vswitch-with-performance-handling-in-the-underlay/.
卡普,布拉德·尼尔森。“无线网络的地理路由。” 哈佛大学,2000 年。http ://citeseerx.ist.psu.edu/viewdoc/download ?doi=10.1.1.115.5738&rep=rep1&type =pdf 。
Karp, Brad Nelson. “Geographic Routing for Wireless Networks.” Harvard University, 2000. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.115.5738&rep=rep1&type=pdf.
考尔、哈明德、哈苏赫普雷特·辛格和阿努拉格·夏尔马。“地理路由协议:回顾。” 国际网格与分布式计算杂志9,no。2(2016):254。
Kaur, Harminder, Harsukhpreet Singh, and Anurag Sharma. “Geographic Routing Protocol: A Review.” International Journal of Grid and Distributed Computing 9, no. 2 (2016): 254.
麦克弗森、丹尼·R.和基尤尔·帕特尔。拥有 BGP-4 协议的经验。征求意见 4277。RFC 编辑器,2006。https: //rfc-editor.org/rfc/rfc4277.txt。
McPherson, Danny R., and Keyur Patel. Experience with the BGP-4 Protocol. Request for Comments 4277. RFC Editor, 2006. https://rfc-editor.org/rfc/rfc4277.txt.
Psounis, K.、A. Ghosh、B. Prabhakar 和 G. Wang。“SIFT:一种跟踪大象流量并利用幂律的简单算法。” 第 43 届阿勒顿通信、控制和计算会议论文集,2005 年。https ://web.stanford.edu/~balaji/papers/05sifta.pdf。
Psounis, K., A. Ghosh, B. Prabhakar, and G. Wang. “SIFT: A Simple Algorithm for Tracking Elephant Flows, and Taking Advantage of Power Laws.” In Proceedings of the 43rd Allerton Conference on Communication, Control and Computing, 2005. https://web.stanford.edu/~balaji/papers/05sifta.pdf.
特谢拉,雷纳塔。“烫手山芋让 BGP 路由升温。” 2005 年 10 月在阿姆斯特丹 RIPE 上发表。https://meetings.ripe.net/ripe-51/presentations/pdf/ripe51-hot-potatoes.pdf。
Teixeira, Renata. “Hot Potatoes Heat Up BGP Routing.” Presented at the RIPE, Amsterdam, October 2005. https://meetings.ripe.net/ripe-51/presentations/pdf/ripe51-hot-potatoes.pdf.
韦勒,娜塔莉。“用于分布式拒绝服务攻击的蜜罐。” IEEE,2002。http: //www.csl.mtu.edu/cs6461/www/Reading/Weiler02.pdf。
Weiler, Nathalie. “Honeypots for Distributed Denial of Service Attacks.” IEEE, 2002. http://www.csl.mtu.edu/cs6461/www/Reading/Weiler02.pdf.
怀特、拉斯. “大象流、结构和 I2RS。” 规则 11 读者,2016 年 10 月 3 日。http ://rule11.tech/i2rs-elephant-flows/。
White, Russ. “Elephant Flows, Fabrics, and I2RS.” Rule 11 Reader, October 3, 2016. http://rule11.tech/i2rs-elephant-flows/.
1. 本章提到运营商可以优化网络的四大类优化。思考本章中给出的用例,并解释这四类优化中的每一类可以适合哪一类以及原因。找到其他三个用例(需要找到来源)并解释它们各自适合四类优化中的哪一类以及原因。是否存在不属于这四大类之一的用例?您会添加什么类别来涵盖所有用例?
1. The chapter mentions four broad classes of optimizations that operators may optimize a network for. Think through the use case examples given in the chapter, and explain which of these four classes of optimization each one could fit into and why. Find three other use cases (need to find sources) and explain which of the four classes of optimization each of them could fit into and why. Are there use cases that do not fit into one of these four broad categories? What category would you add to cover all the use cases?
2.除了冷热土豆路由外,还有“土豆泥路由”。土豆泥路由的定义是什么,它的用途是什么?
2. Besides hot and cold potato routing, there is also “mashed potato routing.” What is the definition of mashed potato routing, and what is it used for?
3. 大象流和老鼠流在本章中被描述为大的、持续的流和短期的小流量。说出将生成每种流的三个应用程序。
3. Elephant and mouse flows are described in the chapter as being large, persistent flows, and short-duration small-volume flows. Name three applications that would generate each kind of flow.
4. 在图17-3中,必须在C和G处配置报文分类,以支持区分A和H之间的流量。为什么需要在两个地方都配置这些?过滤器会一样吗?如果不是,它们会有什么不同?
4. In Figure 17-3, packet classification must be configured at both C and G to support differentiating traffic flowing between A and H. Why would these need to be configured in both places? Would the filters be the same? If not, how would they be different?
尽管看起来并非如此,但信息技术领域是由自我和时尚强烈驱动的。去年“流行”的东西今年将“过时”,除了“这个是新的,那个是旧的”之外,通常没有什么理由。网络工程在这方面也不例外。例如,网络设计在其历史上多次在分散控制平面的“理想”和集中控制平面之间摇摆。哪个真的更好?
As much as it might seem otherwise, the information technology field is strongly driven by egos and fashion. What was “in” last year will be “out” this year, often with very little reason other than “this is new, and that is old.” Network engineering is no different in this regard. For instance, network designs have swung between an “ideal” of decentralized control planes to centralized control planes a number of times over their history. Which is truly better?
消除这些钟摆摆动及其随之而来的炒作因素的最佳方法是能够理解根本问题、根本解决方案以及各种解决方案之间的权衡(以及当前的问题是否需要解决)。根本就解决了——太多设计师和建筑师往往会忘记这一点)。为此,本章将首先解释在转发设备模型上开发的集中控制平面的分类法。
The best way to cut through these pendulum swings, and their attendant hype factor, is to be able to understand the underlying problems, the underlying solutions, and the tradeoffs between the various solutions (as well as whether the problem at hand even needs to be solved at all—a point far too many designers and architects tend to forget). Toward this end, this chapter will begin by explaining a taxonomy of centralized control planes, developed on the model of a forwarding device.
有了这个模型,我们将研究几个具体的例子。这些系统中的每一个都将被放入控制平面需要解决的问题的框架中,并且将检查与分布式控制平面的权衡(前两章中考虑了这些示例)。这些解决方案都属于软件定义网络 (SDN) 或可编程网络 (PN) 的粗略定义类别。
With this model in hand, several specific examples will be surveyed. Each of these systems will be placed into the framework of the problems a control plane needs to solve, and the tradeoffs against distributed control planes, examples of which were considered in the preceding two chapters, will be examined. These solutions all fit into the roughly defined categories of a Software-Defined Network (SDN) or Programmable Network (PN).
SDN 通常表现为一个非此即彼的命题:要么使用分布式协议构建网络,要么使用集中式控制平面构建网络。SDN一词的模糊性导致了这种看待软件定义的方式。具体来说,例如开放最短路径优先 (OSPF) 的实现如何不是软件定义的?总体思路似乎是这样的:在基于硬件的网络中嵌入软件在电器中;软件和硬件作为一个“事物”进行购买、配置和管理。在软件定义中,软件与硬件是分开的。因此,在SDN模型中,软件被视为与设备分离;在分布式模型中,软件传统上被视为硬件的一部分(或嵌入到硬件中)。这给我们带来了第二个错误的非此即彼的分歧。实际上,每个分布式控制平面都是用软件实现的,因此始终可以与硬件分离。
SDN is often presented as an either/or proposition: either you build a network using distributed protocols, or you build a network with a centralized control plane. The nebulous nature of the term SDN contributes to this way of seeing software defined. Specifically, how is an implementation of Open Shortest Path First (OSPF), for instance, not software defined? The general idea seems to be this: in hardware-based networks the software is embedded in appliances; the software and hardware are purchased, configured, and managed as one “thing.” In software defined, the software is separate from the hardware. Hence, in the SDN model, the software is seen as separate from the appliance; in the distributed model, the software has traditionally been seen as part of (or embedded in) the hardware. This brings us to a second false either/or divide. In reality, every distributed control plane is implemented in software, and hence can always be separated from the hardware.
既然软件总是可以与硬件分离,那么SDN如何与“传统”模式区分开来呢?主要思想是将控制平面的部分功能与各个转发设备分离,或者更确切地说,将控制平面的部分功能拉入“集中式”控制平面。值得注意的是,这里使用的术语“集中控制平面”并不意味着控制网络的单个“神盒”。相反,它只是意味着一个控制平面,在某种程度上,并不完全在网络设备上运行。
Given software can always be separated from hardware, how can SDN be differentiated from the “traditional” model? The primary idea revolves around separating some part of the functionality of the control plane from the individual forwarding devices, or rather, pulling some part of the functionality of the control plane into a “centralized” control plane. It is important to note the term centralized control plane, as it is used here, does not mean a single “god box” that controls the network. Rather, it simply means a control plane that, in some way, does not run entirely on the network devices.
SDN 和 PN 世界在很多方面都有自己的术语;最重要的是南向接口和北向接口:
The SDN and PN worlds, in many ways, have their own terminology; the most important are the southbound interface and the northbound interface:
•南向接口是控制器与网络设备之间的接口。
• The southbound interface is the interface between the controller and the network devices.
•北向接口是控制器和应用程序(或业务逻辑)之间的接口。
• The northbound interface is the interface between the controller and applications (or business logic).
在南向接口领域内,存在许多不同的交互点,或者控制器与转发设备交互的方式。
Within the realm of southbound interfaces, there are a number of different interaction points, or ways in which the controller interacts with the forwarding devices.
图 18-1显示了四种不同的控制平面模型:
Figure 18-1 shows four different control plane models:
• 在分布式模型中,控制平面软件主要运行在转发设备上。这并不意味着控制平面软件嵌入在转发设备中。在设备模型中,软件被视为设备本身的嵌入式部分。然而,在分解模型中,软件主要在转发设备上运行,但软件与转发硬件之间有明确的界限。
• In the distributed model, the control plane software runs primarily on forwarding devices. This does not mean the control plane software is embedded in the forwarding device. In an appliance model, the software is treated as an embedded part of the appliance itself. In a disaggregated model, however, the software runs primarily on the forwarding device, but the software is clearly delineated from the forwarding hardware.
• 在增强模型中,控制平面软件主要在转发设备上运行。与分布式模型一样,控制平面不一定嵌入在转发设备中。在增强模型中,本地控制平面进程与路由表(路由信息库或 RIB)交互。离机进程与分布式控制平面交互,以影响 RIB 中安装的无环路路径集。
• In the augmented model, the control plane software runs primarily on forwarding devices. Like the distributed model, the control plane is not necessarily embedded in the forwarding device. In the augmented model, the local control plane processes interact with the routing table (Routing Information Base, or RIB). Off-box processes interact with the distributed control plane to influence the set of loop-free paths installed in the RIB.
• 在混合模型中,控制平面的集中式组件与分布式控制平面并行运行。从每个设备上运行的分布式控制平面进程来看,控制器似乎是另一个并行运行的分布式控制平面(实际上)。从控制器的角度来看,情况也是如此。分布式控制平面只是与控制器并行运行的另一个控制平面。
• In the hybrid model, the centralized component of the control plane runs in parallel with the distributed control plane. From the distributed control plane process running on each device, the controller just appears to be another distributed control plane running in parallel (in effect). From the controller’s perspective, much the same is true; the distributed control plane is just another control plane running in parallel with the controller.
•替换模型中,没有分布式控制平面;集中控制平面是本地交换设备无环路路径的唯一来源。使用此模型实现的一个关键标志是控制器直接与转发表 (FIB) 而不是 RIB 对话。
• In the replace model, there is no distributed control plane; the centralized control plane is the only source of loop-free paths for the local switching device. One key marker of implementations using this model is the controller speaks directly to the forwarding table (FIB) rather than the RIB.
在此框架内,重要的是要询问控制平面的哪一部分(或更具体地说,哪些功能)放置在分布式控制平面中,哪些放置在集中式控制平面中。
Within this framework, it is important to ask which part of the control plane— or more specifically, which functionality—is placed in the distribution control plane and which is placed in the centralized control plane.
在考虑 SDN 和 PN 时,控制平面的哪一部分是集中式的是关键问题。控制平面由哪几部分组成?
Which part of the control plane is centralized is the crucial question when considering SDNs and PNs. What are the parts of a control plane?
控制平面必须提供三个“东西”来支持应用程序和业务:拓扑信息、可达性信息和策略。自网络工程时代开始以来,几乎每个实施的控制平面都假设这三个功能是单个“事物”的一部分,因此它们必须全部在单个协议中完成。
There are three “things” a control plane must provide to support applications and businesses: topology information, reachability information, and policy. Almost every control plane implemented since the beginning of network engineering time has assumed these three functions are part of a single “thing,” and hence they must all be done in a single protocol.
然而,正如数据平面按功能和位置分层一样,将控制平面视为一组可以分为多个层的功能更有意义。这些层会是什么样子?
Just as data planes are layered by function and location, however, it makes more sense to consider the control plane as a set of functions that can be split into layers. What would these layers look like?
• 在某些协议中,发现拓扑和通告可达性是密不可分的,例如路由信息协议(RIP) 和增强型内部网关路由协议(EIGRP)。在其他协议中,例如中间系统到中间系统(IS-IS),最短路径树(SPT)是根据拓扑计算的,可达性作为叶节点“挂在树上”。因此,从概念上讲,很难看出拓扑和可达性如何分层;两条信息之间的相互作用是直接且即时的。
• Discovering topology and advertising reachability are inseparable in some protocols, such as the Routing Information Protocol (RIP) and the Enhanced Interior Gateway Routing Protocol (EIGRP). In other protocols, such as Intermediate System to Intermediate System (IS-IS), the Shortest Path Tree (SPT) is calculated based on the topology, and reachability is “hung off the tree” as leaf nodes. Conceptually, then, it is difficult to see how topology and reachability could be separated into layers; the interplay between the two pieces of information is direct and immediate.
• 另一方面,策略依赖于拓扑和可达性信息,但不与拓扑和可达性交互。事实上,控制平面策略一般是重写控制平面计算出的基本拓扑和可达性信息的过程。
• Policy, on the other hand, relies on the topology and reachability information, but otherwise does not interact with topology and reachability. In fact, control plane policy is generally the process of overriding the basic topology and reachability information calculated by the control plane.
似乎存在一种自然的“分裂”,拓扑和可达性位于分裂的一侧,政策位于分裂的另一侧。那么,可以将控制平面分为两个“层”,底层提供拓扑和可达性信息,上层提供对下层计算的路径的修改,以实施特定策略。
There appears to be a natural “split,” with topology and reachability on one side of the divide, and policy on the other side of the divide. It is possible, then, to break the control plane into two “layers,” with the bottom layer providing topology and reachability information, and the upper layer providing modifications to the paths calculated by the lower layer in order to implement specific policies.
虽然边界网关协议 (BGP) 最初设计用于互连不同公司运营的网络(特别是传输服务提供商网络),但拥有大型数据中心的提供商意识到它可以用于扩展主干和叶子结构。图18-2用于说明数据中心中使用的BGP。
While the Border Gateway Protocol (BGP) was originally designed to interconnect networks operated by different companies—particularly transit service provider networks—providers with large-scale data centers realized it could be used to scale spine and leaf fabrics. Figure 18-2 is used for illustrating BGP as used in a data center.
图18-2展示了使用eBGP作为控制平面的五级脊叶结构;由于脊和叶中没有“交叉链接”,因此路由器 5a 和 5b 之间(使用行和列标识符来标记路由器)之间没有 iBGP。第 1 行和第 5 行是架顶式 (ToR) 设备,连接到使用结构托管应用程序的服务器。
Figure 18-2 shows a five-stage spine and leaf fabric using eBGP as a control plane; as there are no “cross links” in a spine and leaf, there is no iBGP between (using the row and column identifiers to label routers) routers 5a and 5b. Rows 1 and 5 are Top of Rack (ToR) devices, connected to servers hosting the applications using the fabric.
为了提供示例,假设一些流应固定在 5b 和 1d 之间。始终可以使用静态路由手动配置网络中的每个路由器,以将这一流固定到特定路径,但这会产生很多配置错误的机会。
To provide the example, assume some flow should be pinned between 5b and 1d. It is always possible to manually configure each router in the network with static routes to pin this one flow to a specific path, but this creates a lot of opportunities for configuration mistakes.
当然,您始终可以自动化配置。但自动化并没有真正降低复杂性;相反,自动化并没有真正降低复杂性。它只是将复杂性从人到设备界面重新定位为三层结构:人到自动化系统、自动化系统到设备。换句话说,自动化复杂的配置并不会降低配置的复杂性;而是会降低配置的复杂性。它只是让复杂性变得不那么明显。毫无疑问,这有时是一件好事,但毫无疑问,自动化一个糟糕的流程并不能改善流程。自动化可以解决很多问题,但网络工程师在认为自动化配置将“解决所有问题”时需要小心。
Of course, you could always automate the configuration. But automation does not really reduce the amount of complexity; it just relocates the complexity from the human-to-device interface into a three-layer structure, human- to-automation system, automation system to device. In other words, automating a complex configuration does not make the configuration less complex; it just makes the complexity less apparent. There is no doubt this can sometimes be a good thing, but there is also no doubt automating a bad process does not improve the process. Automation can solve many things, but network engineers need to be careful in thinking automated configurations will “solve all problems.”
另一种选择是使用 BGP 作为 SDN,特别是因为 BGP 已在网络中的每个路由器上运行。为此,如图底部所示的 iBGP 控制器连接到结构中的每个路由器。
Another option, particularly since BGP is already running on every router in the network, is to use BGP as an SDN. Toward this end, an iBGP controller, shown at the bottom of the diagram, is connected to every router in the fabric.
笔记
Note
仅显示了少量 iBGP 连接,因此插图仍然可读。
Only a small number of the iBGP connections are shown so the illustration remains readable.
一旦 iBGP 会话就位,控制器就可以“读取”整个拓扑并使用本地策略来确定流应固定到哪条路径,以及哪些流需要避开固定流所经过的路径。例如,假设流应固定到 [5b,4a,3c,2b,1d] 路径。可以在 5b 处注入通过 4a 到达目的地(1d 之后)的低成本路径,然后在 4a 处再次通过 3c,在 3c 处再次通过 2b,依此类推,直到沿该路径的每个路由器处的最佳路径都通过选定的路径。在 BGP 中实现此目的的最简单方法是从具有较低本地优先级的控制器注入一条路由,但有很多方法可以在 BGP 中表达此类策略。
Once the iBGP sessions are in place, the controller can “read” the entire topology and use local policies to determine which path the flow should be pinned to, and also which flows need to avoid the path over which the pinned flow is passing. For instance, assume the flow should be pinned to the [5b,4a,3c,2b,1d] path. A lower-cost path toward the destination (behind 1d) through 4a can be injected at 5b, and again through 3c at 4a, and again through 2b at 3c, etc., until the best path at each router along the path is through the selected path. The easiest way to accomplish this in BGP would be to inject a route from the controller with a lower local preference— but there are many ways to express such a policy in BGP.
这是增强模型的示例;控制平面的集中部分直接与分布式控制平面(eBGP)交互。然而,这是混合模型实现的一个相当有趣的版本,因为用于推送策略的协议(南向接口)与用于发现和分发拓扑和可达性信息的协议相同。
This is an example of an augmented model; the centralized part of the control plane interacts with the distributed control plane (eBGP) directly. This is a rather interesting version of a hybrid model implementation, however, in that the protocol used to push policy (the southbound interface) is the same as the protocol used to discover and distribute topology and reachability information.
与 BGP 不同,链路状态协议专注于寻找到给定目的地的最短路径;虽然大多数实现确实支持可以在协议中携带的标签,但这些标签很少(实际上从未)用于修改流量。原因很简单:链路状态数据库必须在所有路由器之间同步。如果两个路由器对网络拓扑有不同的看法,它们可能会计算出一条穿过网络的循环路径。
Link state protocols, unlike BGP, are focused on finding the shortest path to a given destination; while most implementations do support tags that can be carried in the protocol, these tags are rarely (actually never) used to modify traffic flow. The reason for this is fairly simple: the link state database must be synchronized among all routers. If two routers have a different view of the network topology, it is possible they will compute a looped path through the network.
Fibbing 在这组约束内工作,允许在不修改链路状态协议的情况下计算流量工程路径,例如开放最短路径优先 (OSPF) 或中间系统到中间系统 (IS-IS)。本质上,fibbing 的工作原理是在链路状态数据库中插入类似于伪节点的虚假节点,导致 OSPF 和 IS-IS 更改最短路径,从而对网络中的流量进行工程设计。
Fibbing works within this set of constraints to allow traffic-engineered paths to be computed without modifying the link state protocol, such as Open Shortest Path First (OSPF) or Intermediate System to Intermediate System (IS-IS). Essentially, fibbing works by inserting false nodes, similar to pseudonodes, into the link state database, causing OSPF and IS-IS to change the shortest path, and hence engineering traffic flows through the network.
笔记
Note
该技术要求用于创建这些假节点的路由类型能够承载第三方下一跳;例如,v1 必须能够将与 H 具有相同地址的 h1 的下一跳设置为 D,而不是假节点本身。截至本文撰写时,在链路状态协议中,只有一种路由可以携带第三方下一跳:OSPF 外部路由。这意味着使用 Fibbing 设计流量的目的地必须是外部路由,并且控制器注入的虚假节点和其他信息也必须是 OSPF 外部路由。
This technique requires the route type used to create these fake nodes be able to carry a third-party next hop; v1, for instance, must be able to set the next hop for h1, which has the same address as H, to D, rather than to the fake node itself. Among link state protocols, as of this writing, only one kind of route can carry a third-party next hop: the OSPF external route. This means destinations for which traffic is engineered using fibbing must be external routes, and the fake nodes and other information the controller injects must also be OSPF external routes.
图 18-3说明了一种可能的方式,可以将此类假节点插入网络中以修改流量。
Figure 18-3 illustrates one possible way in which such fake nodes can be inserted into the network to modify traffic flow.
图18-3说明了同一网络中的三个阶段:第一阶段是没有Fibbing的网络,第二阶段是包含Fibbing节点以改变OSPF选择的最佳路径的网络,第三阶段是Fibbing节点优化后的网络。
Figure 18-3 illustrates three stages in the same network: the first stage is the network without fibbing, the second is with fibbing nodes included to alter the best path chosen by OSPF, and the third is after the fibbing nodes have been optimized.
在图18-3所示的顶层网络中,OSPF将选择沿[A,B,C,F,H]从A到H的最佳路径,因为该路径的总成本为40。下一个最短路径通过 [B,D,E,F] 或 [B,D,E,G],两者的成本都是 50。要应用于该网络的策略是强制 A 到 H 的流量沿着路径[A、B、D、E、G、H]。第一步是添加一个控制器,该控制器可以通过参与 OSPF 来使用链路状态数据库,并且还可以将新的 LSA 注入网络中。该控制器连接到 F 并在图中标记为 K。
In 1, the top network illustrated in Figure 18-3, OSPF would choose the best path from A to H along [A,B,C,F,H], as this path has a total cost of 40. The next shortest path is through [B,D,E,F] or [B,D,E,G], both of which have a cost of 50. The policy to be applied to this network is to force the A to H traffic along the path [A,B,D,E,G,H]. The first step is adding a controller that can consume the link state database by participating in OSPF, and can also inject new LSAs into the network. This controller is attached to F and is labeled K in the diagram.
To put the policy in place, the controller must convince
• B 表示通往 H 的最短路径经过 D。
• B that the shortest path toward H passes through D.
• D 表示通往 H 的最短路径经过 E。
• D that the shortest path toward H passes through E.
• E 表示通往 H 的最短路径经过 G。
• E that the shortest path toward H passes through G.
为此,控制器可以将三个假节点的 LSA 注入到网络中:v1、v2和v3,每个 LSA 都将目标 H 通告为直接连接(如图中的h1、h2和h3所示):
To do this, the controller can inject three LSAs for fake nodes into the network, v1, v2, and v3, each of which advertises the destination H as directly connected (shown as h1, h2, and h3 on the diagram):
•来自v1的h1通告已将D 设置为下一跳,因此如果B 选择此通往H 的路径,则流量将转发至D 而不是v1。
• The advertisement for h1 from v1 has D set as the next hop, so that if B chooses this path toward H, the traffic is forwarded to D rather than v1.
•来自v2的h2通告已将E 设置为下一跳,因此如果B 选择此通往H 的路径,则流量将转发至E 而不是v2。
• The advertisement for h2 from v2 has E set as the next hop, so that if B chooses this path toward H, the traffic is forwarded to E rather than v2.
•来自v3的h3通告已将G 设置为下一跳,因此如果B 选择此通往H 的路径,则流量将转发至G 而不是v3。
• The advertisement for h3 from v3 has G set as the next hop, so that if B chooses this path toward H, the traffic is forwarded to G rather than v3.
控制器还必须通告一些新链接,具体来说:
The controller must also advertise some new links—specifically:
• [B,v1] 部分成本低于 40
• [B,v1] with some cost lower than 40
• [v1,B] 具有无限成本
• [v1,B] with an infinite cost
• [v1,D] 不计成本
• [v1,D] with any cost
• [D,v1] 具有无限成本
• [D,v1] with an infinite cost
• [D,v2] 任何成本低于 30
• [D,v2] with any cost less than 30
• [v2,D] 具有无限成本
• [v2,D] with an infinite cost
• [E,v2] 不惜任何代价
• [E,v2] with any cost
• [v2,E] 具有无限成本
• [v2,E] with an infinite cost
• [E,v3] 任何成本低于 20
• [E,v3] with any cost less than 20
• [v3,E] 具有无限成本
• [v3,E] with an infinite cost
• [v3,G] 不计成本
• [v3,G] with any cost
• [G,v3] 成本无限
• [G,v3] with an infinite cost
给定这组节点和链接:
Given this set of nodes and links:
• B 将计算通过v1 到H 的路径,并将流向H 的流量转发到D(因为v1 到h1 通告的下一跳是通过D)。
• B will compute the path to H through v1, and forward the traffic toward H to D (because the next hop advertised by v1 to h1 is through D).
• D 将计算通过v2 到H 的路径,并将流向H 的流量转发到E(因为v2 通告到h2 的下一跳是通过E)。
• D will compute the path to H through v2, and forward the traffic toward H to E (because the next hop advertised by v2 to h2 is through E).
• E 将计算通过v3 到H 的路径,并将流向H 的流量转发到G(因为v3 到h3 通告的下一跳是通过G)。
• E will compute the path to H through v3, and forward the traffic toward H to G (because the next hop advertised by v3 to h3 is through G).
然后,这些备用最佳路径将沿着路径 [A,B,v1(到 D),v2(到 E),v3(到 G),H] 承载流量。每跳添加一个节点可能看起来效率很低;因此,纤维加工过程包括一个优化步骤。在计算出的路径上的每一跳,算法都可以计算出每个路由器将转发流量的位置。在B的情况下,最短路径通常是通过C,因此需要假节点来重定向流量。然而,在 D 的情况下,最短路径通常是通过 E,这是正确的路径;不需要创建假节点来说服 D 通过 E 将流量转发到 H。在 E 的情况下,有两条等成本路径;需要假节点来迫使 E 选择两者中的正确路径。最终的网络3,如图18-3所示,显示了插入网络中的优化假节点集。
These alternate best paths, then, will carry the traffic along the path [A,B,v1(to D),v2(to E),v3(to G),H]. Adding a node per hop might seem inefficient; hence the fibbing process includes an optimization step. At each hop along the calculated path, the algorithm can compute where each router would forward traffic anyway. In the case of B, the shortest path is normally through C, so the fake node is required to redirect the traffic. In the case of D, however, the shortest path is normally through E, which is the correct path; the fake node does not need to be created to convince D to forward traffic toward H through E. In the case of E, there are two equal cost paths; the fake node would be needed to force E to choose the correct path of the two. The final network, 3, illustrated in Figure 18-3, shows the optimized set of fake nodes inserted in the network.
互联网工程任务组 (IETF) 于 2012 年开始开展路由系统接口 (I2RS) 的工作。最初的章程是在 RIB 中构建一个接口,充当 RIB 与设备外进程或设备之间的接口。应用。直接引用RFC7920的问题陈述:
Work on the Interface to the Routing Systems (I2RS) began in the Internet Engineering Task Force (IETF) in 2012. The original charter was to build an interface into the RIB to act as an interface between the RIB and an off-device process or application. To quote the problem statement RFC7920, directly:
传统上,路由系统已实现路由和信令(例如,多协议标签交换或MPLS)来控制网络中的流量转发。路由计算由定义链路成本、路由成本或导入和导出路由策略的相对静态策略控制。由于高度动态的数据中心网络、按需广域网 (WAN) 服务、动态策略驱动的流量引导和服务链的出现、实时性的需求,出现了更加动态地管理和编程路由系统的要求。通过流量控制实现安全威胁响应,以及将基于策略的决策与路由器本身分离的范例。这些要求应该允许控制路由信息和流量路径并提取网络拓扑信息、流量统计、1
Traditionally, routing systems have implemented routing and signaling (e.g., Multiprotocol Label Switching, or MPLS) to control traffic forwarding in a network. Route computation has been controlled by relatively static policies that define link cost, route cost, or import and export routing policies. Requirements have emerged to more dynamically manage and program routing systems due to the advent of highly dynamic data-center networking, on-demand Wide Area Network (WAN) services, dynamic policy-driven traffic steering and service chaining, the need for real-time security threat responsiveness via traffic control, and a paradigm of separating policy-based decision-making from the router itself. These requirements should allow controlling routing information and traffic paths and extracting network topology information, traffic statistics, and other network analytics from routing systems.1
图 18-4说明了 I2RS 的架构。
Figure 18-4 illustrates the architecture of I2RS.
在图18-4中,有几个关键组件:
In Figure 18-4, there are several critical components:
•应用程序,通常是某种网络级编排包,为用户提供“业务策略”或“以意图为中心”的界面。该应用程序负责将意图转换为 I2RS 控制器可以理解的某种形式的输入,或者将 I2RS 控制器提供的信息转换为某种形式的人类可读信息(例如当前在 I2RS 控制器上启用的所有拓扑的覆盖视图)。网络)。
• The application, which is normally some sort of network-level orchestration package providing a “business policy” or “intent-focused” interface to the user. This application is responsible for translating intent into some form of input that the I2RS controller can understand, or translating the information that the I2RS controller provides into some form of human-readable information (such as an overlay view of all the topologies currently enabled on the network).
•北向应用程序编程接口(API),I2RS 规范未定义该接口。
• The northbound Application Programming Interface (API), which is not defined by the I2RS specifications.
• I2RS 控制器,它是在某个服务器(虚拟服务器或其他服务器)上执行的程序包。这会将应用程序的意图和“人类可读”请求转换为南向 API的格式。
• The I2RS controller, which is a package executing on a server someplace (virtual or otherwise). This translates the intent and “human readable” requests from the application into the format of the southbound API.
•南向API,它是通过几种不同传输机制之一承载的YANG 模型数据。
• The southbound API, which is YANG-modeled data carried over one of several different transport mechanisms.
• I2RS 代理,它执行两件事:
• The I2RS agent, which does two things:
•将YANG 建模数据转换为本地RIB API 调用以安装、删除和修改路由。
• Translates the YANG-modeled data into local RIB API calls to install, remove, and modify routes.
•将本地RIB 和路由信息转换为描述网络拓扑(包括覆盖拓扑)的YANG 模型。
• Translates local RIB and routing information into YANG models describing the topology of the network (including overlay topologies).
可以在每个路由器上仅运行一个 I2RS 代理,从而完全取代分布式控制平面。在这种情况下,控制器将从 RIB 获取连接的接口和目标信息,可能使用来自其他协议(例如链路本地发现协议或 LLDP)的信息来验证相邻连接的路由器,以构建完整的视图网络。基于根据此信息,控制器可以使用任何一种无环路路径计算机制来计算一组通过网络的无环路路由,根据应用程序馈送到控制器的策略来修改它们,然后将生成的路由分发到每个路由器上的 RIB。以这种方式部署 I2RS 将是本章第一部分讨论的替换模型的一个示例。
It is possible to run only an I2RS agent on each router, replacing the distributed control plane completely. In this case, the controller would take the connected interface and destination information from the RIB, possibly using information from other protocols (such as the Link Local Discovery Protocol, or LLDP), to verify adjacent connected routers, to build a complete view of the network. Based on this information, the controller could use any one of the loop-free path calculation mechanisms to calculate a set of loop-free routes through the network, modify them based on the policy being fed to the controller by the application(s), and then distribute the resulting routes to the RIB at each router. Deploying I2RS in this way would be an example of the replace model discussed in the first section of this chapter.
然而, I2RS 并非设计用于在替换模式下部署。I2RS 代理与 RIB 的接口方式与设备上运行的任何分布式路由协议相同,允许 I2RS 与其他控制平面并行运行;这属于本章第一部分考虑的混合模式。如图 18-5所示。
I2RS is not designed to be deployed in replace mode, however. The I2RS agent, which interfaces with the RIB in the same way any distributed routing protocol running on the device does, allows I2RS to act in parallel with other control planes; this would fall under the hybrid mode considered in the first part of this chapter. Figure 18-5 illustrates.
在图18-5中,A和H都向K上的两个不同服务发送大量数据。路由协议计算出的从A到K的最短路径是沿着路径[B,E,G];路由协议计算出的从H到K的最短路径为[E,G]。如果这两个流都放置在 [E,G] 链路上,则可能会压垮该链路,因此网络运营商希望将 A 的流量移至备用路径。这可以表达为诸如“网络中任意两条路径的利用率之间的差异不应超过 20%”之类的策略或类似的策略。
In Figure 18-5, A and H are both sending large streams of data to two different services residing on K. The shortest path, calculated by the routing protocol, from A to K is along the path [B,E,G]; the shortest path calculated by the routing protocol from H to K is [E,G]. If both of these flows are placed on the [E,G] link, it could overwhelm the link, so the network operator would like to move A’s traffic to an alternate path. This might be expressed as a policy something like “the differential between the utilization of any two paths in the network should not be more than 20%,” or something similar.
然后控制器 C 可以监控网络中的每个链路;当 A 和 H 都发送流量时,控制器可以注意到 [E,G] 链路不符合策略,因此会寻找一些备用路径来发送部分流量。显而易见的选择是源自 A 的流量;不太明显的是将该流量发送到哪里。控制器有多种可用选项,具体取决于网络中每个设备的功能。例如:
The controller, C, can then monitor each link in the network; when both A and H send traffic, the controller can note the [E,G] link is out of policy, and hence look for some alternate path over which to send some part of the traffic. The obvious choice will be the traffic originating at A; what is not so obvious is where to send this traffic. There are a number of options available to the controller, depending on the capabilities of each device in the network. For instance:
• 如果网络支持 MPLS 标签堆栈,则控制器可以在连接 B 到 A 的入站端口上的流量上施加标签堆栈,从而使流量遵循路径 [B,E,F,G];这将使用 I2RS 实现分段路由,将标签堆栈推送到网络设备。
• If the network supports MPLS label stacks, the controller could impose a label stack on the traffic on the inbound port connecting B to A, causing the traffic to follow the path [B,E,F,G]; this would be implementing segment routing using I2RS to push the label stacks to network devices.
• 如果E 支持基于源地址和目标地址的转发,则控制器可以推送一条转发规则,规定所有源自A 且发往K 的流量应转发至F 而不是转发至G;当然,控制器需要计算 F 不会将流量转发回 E,这取决于本地链路指标。
• If E supports forwarding based on the source and destination addresses, the controller could push a forwarding rule stating all traffic sourced from A, and destined to K, should be forwarded toward F instead of toward G; the controller would need to calculate that F will not forward the traffic back to E, of course, which would depend on the local link metrics.
• 如果F 由于某种原因通常会使用经过E 的路径到达K,则控制器可以在B、D、F 和G 中设置基于目的地的转发规则,以使流量源自A,目的地为K,遵循路径 [B,D,F,G]。
• If F, for some reason, would normally use the path through E to reach K, the controller can set destination-based forwarding rules in B, D, F, and G to cause the traffic sourced from A, and destined to K, to follow the path [B,D,F,G].
网络中的所有其他流量将继续遵循与 I2RS 并行运行的分布式路由协议计算的路由。这意味着在此示例中 I2RS 用于混合模型可编程网络模式。这就是 I2RS 旨在填补的操作角色。
All other traffic in the network would continue to follow the routes calculated by the distributed routing protocol running in parallel with I2RS. This means I2RS is being used in a hybrid model programmable network mode in this example. This is the operational role I2RS was designed to fill.
I2RS使用YANG建模语言来描述转发和拓扑信息。例如,一条路线被建模为一组对象,如图18-6所示。
I2RS uses the YANG modeling language to describe forwarding and topology information. For instance, a route is modeled as a set of objects, as shown in Figure 18-6.
图18-6所示的路由模型中的三种对象如下:
The three kinds of objects in a route model shown in Figure 18-6 are as follows:
• 路由属性,例如度量。
• Route attributes, such as the metric.
• 路由匹配,即路由中与目的地址相匹配的部分;处理时,数据包的目的地可以在 IPv4 地址、IPv6 地址、MPLS 标签、媒体访问控制 (MAC) 地址或接口上进行匹配。
• The route match, which is the portion of the route that is matched to the destination address; when being processed, the destination of the packet can be matched on an IPv4 address, an IPv6 address, an MPLS label, a Media Access Control (MAC) address, or an interface.
• 当路由和属性匹配时,数据包将被发送到下一跳字段中包含的内容。
• When the route and the attributes match, the packet is sent to what is contained in the next hop field.
为什么不将其定义为单个结构,而不是定义为一组相关对象?毕竟,这种结构似乎使单一路线的模型变得更加复杂。然而,这里的优点与将信息编码到类型长度向量(TLV)中的优点相同;如果需要某种新类型的匹配、需要某种新属性或者需要某种新类型的下一跳,则扩展模型是非常容易的。一个具体的例子是等价多路径(ECMP)组的想法。下一跳对象可以是单个下一跳,也可以是 ECMP 组形式的下一跳集合,甚至可能是下一跳和快速重路由下一跳(备用下一跳)。
Why not define this in a single structure, rather than as a set of related objects? After all, this sort of structure appears to make the model of a single route more complex. The advantage here, however, is the same as the advantages of encoding information into a Type Length Vector (TLV); it is very easy to extend the model if some new kind of match is needed, some new attribute is needed, or some new kind of next hop is needed. One specific example is the idea of an equal cost multipath (ECMP) group. The next hop object can be a single next hop, or a collection of next hops in the form of an ECMP group, or even, perhaps, a next hop and a fast reroute next hop (an alternate next hop).
每条路线的模型以 YANG 表示,如下所示:
The model of each route, expressed in YANG, looks like this:
+--rw 路由列表* [路由索引]
| +--rw 路由索引 uint64
| +--rw 匹配
| | +--rw(路线类型)?
| | +--:(ipv4)
| | | ...
| | +--:(ipv6)
| | | ...
| | +--:(mpls 路由)
| | | ...
| | +--:(mac 路由)
| | | ...
| | +--:(接口路由)
| | ...
| +--rw 下一跳
| | +--rw nexthop-id?uint32
| | +--rw 共享标志?布尔
| | +--rw(下一跳类型)?
| | +--:(nexthop-base)
| | | ...
| | +--:(nexthop-chain) {nexthop-chain}?
| | | ...
| | +--:(nexthop-replicates) {nexthop-replicates}?
| | | ...
| | +--:(nexthop-保护) {nexthop-保护}?
| | | ...
| | +--:(nexthop-load-balance) {nexthop-load-balance}?
| | ...
| +--rw 路由状态
| | ...
| +--rw 路由属性
| | ...
| +--rw 路由供应商属性
+--rw nexthop-list* [nexthop-member-id]
+--rw nexthop-member-id uint32
+--rw route-list* [route-index]
| +--rw route-index uint64
| +--rw match
| | +--rw (route-type)?
| | +--:(ipv4)
| | | ...
| | +--:(ipv6)
| | | ...
| | +--:(mpls-route)
| | | ...
| | +--:(mac-route)
| | | ...
| | +--:(interface-route)
| | ...
| +--rw nexthop
| | +--rw nexthop-id?uint32
| | +--rw sharing-flag? boolean
| | +--rw (nexthop-type)?
| | +--:(nexthop-base)
| | | ...
| | +--:(nexthop-chain) {nexthop-chain}?
| | | ...
| | +--:(nexthop-replicates) {nexthop-replicates}?
| | | ...
| | +--:(nexthop-protection) {nexthop-protection}?
| | | ...
| | +--:(nexthop-load-balance) {nexthop-load-balance}?
| | ...
| +--rw route-status
| | ...
| +--rw route-attributes
| | ...
| +--rw route-vendor-attributes
+--rw nexthop-list* [nexthop-member-id]
+--rw nexthop-member-id uint32
您可以看到图中显示的每个元素在 YANG 模型中以人类可读的文本格式排列。
You can see each of the elements shown here in the diagram laid out in a human-readable, textual format within the YANG model.
最初的路径控制元素协议 (PCEP) 工作可以追溯到 2000 年代初,第一个 IETF RFC (4655) 于 2006 年成为信息性的,这意味着 PCEP 早于 SDN 的“酷”时代。PCEP 的创建是因为通过(主要是)服务提供商 (SP) 网络计算流量工程 (TE) 路径变得越来越复杂。三个具体的发展推动了 PCEP 的设计、标准化和部署:
The original Path Control Element Protocol (PCEP) work dates from the early 2000s, with the first IETF RFC (4655) being made informational in 2006, which means PCEP predates the time when SDNs were “cool.” PCEP was created because of the increasingly complex nature of computing Traffic Engineering (TE) paths through (primarily) Service Provider (SP) networks. Three specific developments drove the design, standardization, and deployment of PCEP:
• 在具有许多不同可用路径的大型分散网络中计算 TE 路径的复杂性
• The complexity of calculating TE paths across large, dispersed networks with a lot of different available paths
• 跨多个组织和内部网络边界计算 TE 路径的复杂性;例如多个泛洪域、与 BGP 拼接在一起的多个内部网关协议或多个 BGP 自治系统
• The complexity of calculating TE paths across multiple organizations and internal network boundaries; for instance multiple flooding domains, multiple interior gateway protocols stitched together with BGP, or multiple BGP autonomous systems
• 通过多个抽象级别计算TE 路径的复杂性,例如在光路之上计算MPLS TE 路径;这包括计算共享风险链路组 (SRLG) 的难度,其中大量虚拟拓扑跨越一组复杂的物理(主要是光学)链路
• The complexity of computing TE paths through multiple levels of abstraction, such as computing an MPLS TE path on top of an optical path; this includes the difficulty of computing Shared Risk Link Groups (SRLGs) where a large set of virtual topologies cross a complex set of physical (primarily optical) links
在每种情况下计算 TE 路径所需的状态要么非常困难,要么不可能在单个分布式控制平面中组装。所有这些功能都需要某种基于覆盖控制器的网络,该网络具有对整个网络的可见性,包括通过应用层的物理网络,以及跨管理和故障域边界的网络。
The state necessary to compute TE paths in each of these situations is either very difficult or impossible to assemble in a single distributed control plane. All of these functions require some sort of overlay controller-based network with visibility into the entire network, including the physical through the application layers, and across administrative and failure domain boundaries.
如果这组要求听起来很熟悉,那么它应该是;本章讨论的许多 SDN 类型覆盖都是为了解决该问题集的某些变体而创建的。图 18-7说明了 PCEP 生态系统的组成部分。
If this set of requirements is starting to sound familiar, it should be; many of the SDN type overlays discussed in this chapter were created to solve some variant of this problem set. Figure 18-7 illustrates the components of the PCEP ecosystem.
PCEP 有四个关键组成部分,如图18-7所示:
There are four crucial components of PCEP shown in Figure 18-7:
• PCC是路径计算客户端;这是请求通过网络配置新 TE 路径的应用程序或服务。
• The PCC is the Path Computation Client; this is the application or service requesting a new TE path be configured through the network.
• PCE 是路径计算元素;这是具有网络整体视图的控制器,它计算通过网络的 TE 路径(通常使用某种形式的约束 SPF)。
• The PCE is the Path Computation Element; this is the controller with the overall view of the network, and it computes the TE path through the network (normally using some form of Constrained SPF).
• LER 是标签边缘路由器;这是通过网络的 TE 标签交换路径 (LSP) 的头端和尾端。
• The LER is the Label Edge Router; this is the head- and tailend of the TE Label Switched Path (LSP) through the network.
• LSR 是标签交换路由器;这些只是根据标签进行转发,因为它们是由 PCE 使用 PCEP 配置的。
• The LSR is the Label Switch Router; these simply forward based on the labels as they are configured by the PCE using PCEP.
在单个网络(域或自治系统)中,可能存在多个可以以多种不同方式进行通信的PCE。例如,PCE 可以使用链路状态协议或 BGP 共享拓扑信息(特别是如果 BGP 通过 BGP-LS 承载拓扑信息)。还可能存在一个或多个PCC。PCEP 还旨在构建跨域或自治系统的路径;一组PCC可以相互通信以构建跨多个提供商网络的TE路径,指示本地PCE通过路径上的每个LSR建立正确的LSP。
In a single network (domain or autonomous system), there may be multiple PCEs that may communicate in a number of different ways. For instance, PCEs may share topology information using a link state protocol or BGP (particularly if BGP is carrying topology information through BGP-LS). There may also be one or more PCCs. PCEP is also designed to build paths across domains or autonomous systems; a set of PCCs may communicate with one another to build a TE path across multiple provider networks, instructing local PCEs to set up the correct LSPs through each LSR along the path.
PCEP中TE路径通常的设计方式是为每个设备配置一组简单的转发规则;接收到的带有一个标签(例如X )的任何数据包都会从带有新标签Y的指定接口转发出去。这与在每一跳交换外部标签的任何其他 MPLS 技术完全相同。
The way a TE path is normally designed in PCEP is each device is configured with a simple set of forwarding rules; any packet received with one label, say X, is forwarded out the indicated interface with a new label Y. This is exactly the same as any other MPLS technology that swaps the outer label at each hop.
PCEP 作为一种协议,高度适应将入站标签、出站接口和出站标签插入每个 LER 和 LSR 的转发表的过程。虽然 PCEP 确实将信息编码到 TLV 中,但没有插入任何类型的过滤或流量分类规则的特定功能。控制器必须能够配置 LER,以某种方式将正确的流量引导到 LSP 头端。当然,可以配置一个标签路由到 NULL0 接口,从而有效地过滤数据包流,因此可以使用 PCEP 进行某些形式的数据包过滤。
PCEP, as a protocol, is highly tuned to the process of inserting the inbound label, outbound interface, and outbound label into the forwarding table at each LER and LSR. While PCEP does encode information into TLVs, there is no specific capability to insert filtering or traffic classification rules of any kind. The controller must be able to configure the LER to channel the correct traffic into the LSP headend in some way. It is possible, of course, to configure a label to be routed to the NULL0 interface, which effectively filters the packet stream, so it is possible to do some forms of packet filtering using PCEP.
PCEP 属于本章第一部分描述的混合模型。
PCEP falls into the hybrid model described in the first part of this chapter.
OpenFlow 让 SDN 技术变得“酷”。该项目于 2006 年开始,存在两个问题。第一个是斯坦福大学的一个项目,围绕网络中的集中管理策略而建立。第二个是其他大学的一组项目,研究人员希望尝试构建路由协议的新方法;然而,最终用户无法通过在其上安装新的路由代码来修改当时可用的硬件平台。这些要求为分离控制平面和转发平面的概念注入了新的活力,其驱动力是在控制平面和 FIB 之间传输信息的标准协议的理念。图 18-8说明了基本概念。
OpenFlow made SDN technology “cool.” The project began in 2006 with two sets of problems. The first was a project at Stanford built around centrally managing policy in a network. The second was a group of projects in other universities where researchers wanted to try new ways of building routing protocols; however, the hardware platforms available at the time were not something end users could modify by installing new routing code on them. These requirements breathed new life into the concept of separating the control and forwarding planes, driven by the idea of a standard protocol to carry information between the control plane and the FIB. Figure 18-8 illustrates the basic concept.
图18-8说明了最基本的OpenFlow配置。交换设备根本没有任何控制平面,因为控制器直接与 FIB 交互。OpenFlow 提供了一种数据包格式和一种可以承载这些数据包的协议,直接描述 FIB 中的转发表条目。FIB 在 OpenFlow 文档中被称为流表,因为它包含交换机需要了解的每个单独流的信息。
Figure 18-8 illustrates the most basic OpenFlow configuration. The switching device does not have any control plane at all, as the controller interacts directly with the FIB. OpenFlow provides a packet format and a protocol over which these packets can be carried that describes forwarding table entries in the FIB directly. The FIB, in OpenFlow documentation, is referred to as the flow table, as it contains information about each individual flow the switch needs to know about.
请注意此处的措辞:每个单独的流。这是因为 OpenFlow 最初设计为对数据包标头中的任何字段(可能)每个字段进行操作。
Note the wording here: each individual flow. This is because OpenFlow was originally designed to operate on any and (possibly) every field in a packet header.
控制器指定交换机应该匹配的一组位和偏移量,然后指定数据包与指定模式匹配时要采取的一组操作。然后,交换机可以检查它处理的每个数据包,看看它是否与该模式匹配。例如,该模式可能包含源和目标互联网协议 (IP) 地址、源和目标媒体访问地址、协议号、端口号以及数据包标头中包含的任何内容。
The controller specifies a set of bits and an offset the switch is supposed to match, and then a set of actions to take if a packet matches the specified pattern. The switch, then, can just check each packet it processes to see if it matches this pattern. The pattern might contain, for instance, the source and destination Internet Protocol (IP) addresses, the source and destination media access addresses, protocol numbers, port numbers, and just about anything contained in the packet header.
不可能构建能够包含通过设备的每个流的信息的硬件。控制器也不可能了解连接到网络的每个主机发起的每个流。为了解决这些问题,OpenFlow 通常被实现为反应式控制平面。这意味着处理新流需要几个步骤:
It is impossible to build hardware able to contain information on every flow passing through the device. It is impossible, as well, for the controller to know about every flow being initiated by every host attached to the network. To resolve these problems, OpenFlow is normally implemented as a reactive control plane. This means processing a new stream takes several steps:
1. 主机开始在新流中发送数据包。
1. The host starts sending packets in the new stream.
2. 第一跳交换机收到这些数据包,发现自己没有与新流匹配的流标签。
2. The first hop switch receives these packets and finds it has no flow label matching the new flow.
3. 第一跳交换机将数据包发送到控制器。
3. The first hop switch will send the packets to the controller.
4. 控制器检查数据包,找到匹配的策略(如果有),并计算一条通过网络的无环路路径。
4. The controller examines the packet, finds a matching policy (if there is one), and computes a loop-free path through the network.
5. 控制器在该流中的数据包将经过的每个交换机中安装该新流的流标签信息。
5. The controller installs flow label information for this new flow in every switch through which packets in this flow will pass.
6. 交换机现在可以正常转发流量。
6. The switches now forward traffic normally.
流标签被缓存,这意味着每个流标签都会被保留,直到它有一段时间没有被使用为止。那么,OpenFlow 最初被设计为并且通常被部署为反应性控制平面,这意味着控制平面依赖于数据平面动态(接近实时)提供的信息来构建转发信息。
Flow labels are cached, which means each flow label is held until it has not been used for some time. OpenFlow, then, was originally designed as, and is often deployed as, a reactive control plane, which means the control plane relies on information dynamically (in near real time) supplied by the data plane to build forwarding information.
这种处理在许多环境中通常不可扩展,特别是在 OpenFlow 被认为是理想的环境中——用于构建私有云和公共云的超大规模数据中心结构。因此,许多实现依赖于通配符流标签,其工作方式与 IP 路由非常相似;如果信息的子集匹配,则根据为部分匹配给出的规则处理数据包。与更传统的 IP 路由非常相似,部分匹配通常是目标子网。
This kind of processing is generally not scalable in many environments, particularly in the environment OpenFlow is considered ideal for—hyperscale data center fabrics for building private and public clouds. Because of this, many implementations rely on wildcard flow labels, which work much like IP routes; if a subset of the information is matched, the packet is processed based on the rules given for the partial match. Much like more traditional IP routing, the partial match is often the destination subnet.
虽然 OpenFlow 通常与设备外控制器一起显示,但这并不是使用 OpenFlow 的唯一部署模式。图 18-9说明了这一点。
While OpenFlow is often shown with an off-device controller, this is not the only deployment pattern where OpenFlow has been used. Figure 18-9 illustrates.
在图18-9中,展示了两个机箱设备。每一个都有一个运行标准分布式路由协议的处理器(或计算引擎)。该路由引擎与 OpenFlow 控制器进行通信设备,可能在同一处理器上运行。然后,该控制器使用 OpenFlow 将路由发送到各个线卡,每个线卡都充当一种独立的交换机。整个单元可能看起来是一个相当标准的机箱交换机,OpenFlow 被用作组件之间的一种进程间通信 (IPC) 系统。这种设计的优点是可以使用具有不同类型处理器的线卡。只要每种处理器都有 OpenFlow 接口,控制器下(以及堆栈或机箱内)的硬件就可以相当容易地更换。
In Figure 18-9, two chassis devices are represented. In each one, there is a processor (or compute engine) running a standard distributed routing protocol. This routing engine communicates with an OpenFlow controller within the device, perhaps running on the same processor. This controller then uses OpenFlow to send routes to individual line cards, each of which acts as a sort of independent switch. The entire unit might appear to be a fairly standard chassis switch, with OpenFlow being used as a sort of Interprocess Communication (IPC) system between the components. The advantage in such a design is that line cards with different sorts of processors can be used; so long as each kind of processor has an OpenFlow interface, the hardware under the controller (and within the stack or chassis) can be replaced fairly easily.
集权往往可以在政策执行方面带来很多好处;一些工程师和研究人员认为集中式控制平面比分布式控制平面简单得多。那么为什么不完全集中所有控制平面呢?答案在于另一个三向权衡问题,很像第一章“基本概念”中讨论的状态/优化/表面(SOS)三向问题。要理解这个问题,请考虑图 18-10。
Centralization can often bring many benefits in terms of policy implementation; some engineers and researchers think centralized control planes are much simpler than distributed control planes. Why not completely centralize all control planes, then? The answer lies in another three-way tradeoff problem, much like the State/Optimization/Surface (SOS) three-way problem discussed in Chapter 1, “Fundamental Concepts.” To understand this problem, consider Figure 18-10.
图 18-10中有四组:
There are four sets illustrated in Figure 18-10:
1. 单个服务器上的单个数据库,由运行在两台主机(C 和 D)上的两个进程访问
1. A single database on a single server, accessed by two processes running on two hosts, C and D
2. 一对包含相同信息(必须同步)的数据库运行在一台服务器上,由运行在两台主机 C 和 D 上的两个进程访问
2. A pair of databases containing the same information (which must be synchronized) running on a single server, accessed by two processes running on two hosts, C and D
3. 一对包含相同信息(必须同步)的数据库运行在通过单线连接的一对服务器上,由运行在两台主机 C 和 D 上的两个进程访问
3. A pair of databases containing the same information (which must be synchronized) running on a pair of servers connected by a single wire, accessed by two processes running on two hosts, C and D
4. 一对包含相同信息(必须同步)的数据库运行在通过路由器连接的一对服务器上,由运行在两台主机 C 和 D 上的两个进程访问
4. A pair of databases containing the same information (which must be synchronized) running on a pair of servers connected through a router, accessed by two processes running on two hosts, C and D
现在考虑一下如果 C 写入一些信息并且 D 立即读取它,在每种情况下会发生什么:
Now consider what happens in each case if C writes some piece of information and D immediately reads it:
1、信息写入数据库;当D读取信息时,它将与C写入的信息相同。
1. The information is written to the database; when D reads the information, it will be identical to what C has written.
2. 信息被写入一个数据库,需要一些时间才能同步到另一个数据库,因为它必须通过某种内部总线传输,以便可以从一个数据库复制到另一个数据库。如果D立即从B读取信息,它将收到旧信息;D 必须等待同步过程完成才能看到信息(或数据库)的准确副本。
2. The information is written to one database, and it takes a few moments to be synchronized to the other database because it must be transferred across some sort of internal bus so it can be copied from one database to the other. If D reads the information immediately from B, it will receive the old information; D must wait for the synchronization process to complete to see an accurate copy of the information (or the database).
3. 一旦 C 将信息写入复制 A,该信息必须通过内部总线到达网络接口,编组为某种形式的数据包,序列化到线路上(可能在排队几分钟后),然后复制第二个服务器上的线路,通过第二个服务器的内部总线,然后同步到数据库的第二个副本。
3. Once C has written the information to copy A, the information must cross an internal bus to a network interface, be marshaled into some form of data packet, serialized onto the wire (potentially after being queued for a few moments), copied off the wire at the second server, passed over the second server’s internal bus, and then synchronized to the second copy of the database.
4. 一旦 C 将信息写入复制 A,该信息必须通过内部总线到达网络接口,编组为某种形式的数据包,序列化到线路上(可能在排队一段时间后),然后复制由路由器传输到线路,在内存中进行处理和交换,在路由器中排队,序列化回到线路上,在第二个服务器上从线路上复制,通过第二个服务器的内部总线,然后同步到数据库的第二个副本。
4. Once C has written the information to copy A, the information must cross an internal bus to a network interface, be marshaled into some form of data packet, serialized onto the wire (potentially after being queued for a few moments), copied off the wire by the router, processed and switched in memory, queued in the router, serialized back onto the wire, copied off the wire at the second server, passed over the second server’s internal bus, and then synchronized to the second copy of the database.
显然,数据库的两个副本在逻辑上相距越远,在 C 写入完成后,副本 B 中的信息与副本 A 中的信息相匹配所需的时间就越长。这是 CAP 定理的前三分之一:可分配性。Set 1中的数据库未分区;当您从左向右移动时,通过添加更多进程,在同步数据库的两个副本之前信息必须经过这些进程,数据库变得更加“强”分区。
It should be obvious that the farther apart the two copies of the database are logically, the longer it will take for the information in copy B to match the information in copy A after C has finished writing. This is the first third of the CAP theorem: partitionability. The database in set 1 is not partitioned; as you move left to right, the database becomes more “strongly” partitioned by adding more processes that the information must pass through before the two copies of the database can be synchronized.
假设您必须确保 D 检索的信息与 C 写入的信息完全相同。确保这一点的最简单方法是简单地阻止 D 读取 B 处的副本,直到您知道两个副本已同步为止。换句话说,您可以阻止 D 对数据库的访问。这是 CAP 定理的第二个三分之一,“A”——可达性。您可以通过使数据库在部分时间不可读或难以访问来解决通过对数据库进行分区所解决的同步问题。
Assume you must ensure that the information D retrieves is exactly what C writes. The simplest way to ensure this is to simply block D from reading the copy at B until you know the two copies are synchronized. To put this in other terms, you can block D’s access to the database. This is the second third of the CAP theorem, the “A”— accessibility. You can solve the synchronization problem solved by partitioning the database by making the database unreadable part of the time, or less accessible.
另一种假设可能是您不需要 B 读取与 C 写入的信息完全相同的信息;因此读取和写入不需要一致。这是 CAP 定理的第三个三分之一,即“C”——一致性。
An alternate assumption might be that you do not need B to read precisely the same information as C has written; hence the read and write do not need to be consistent. This is the third third of the CAP theorem, the “C”—consistency.
综上所述,CAP 定理指出构建数据库需要三个设计参数:一致性、可访问性和可分区性。在某种程度上,您可以选择这三个中的两个。
Putting this all together, the CAP theorem states there are three design parameters in building a database: consistency, accessibility, and partitionability. You can choose, in some measure, two of the three.
这如何应用于控制平面?答案非常简单:如果您希望获得一致的网络视图,则必须以某种方式在某些时间段(特别是在网络收敛时)阻止对包含网络拓扑描述的数据库的访问。分布式路由协议所做的就是始终允许访问,并且只是“忍受”这种“始终可用”的分布式拓扑数据库和可达性信息所导致的不一致。
How does this apply to control planes? The answer is quite simple: if you want to have a consistent view of the network, you must somehow block access to the database containing the description of the network topology during some periods of time (specifically while the network is converging). What distributed routing protocols do is to allow access all the time, and simply “live with” the inconsistencies resulting from this “always available” distributed database of topology and reachability information.
然而,集中控制平面面临着双重问题。首先,数据库现在分布在实际转发设备和具有描述网络的数据库的设备之间。其次,不能有“只有一个控制器”——这会造成不可接受的单点故障。为了防止出现单点故障,必须至少有两个控制器。这些控制器必须以某种方式同步。
Centralized control planes, however, face a double problem. First, the database is now distributed between the actual forwarding devices and the device with the database describing the network. Second, there cannot be “only one controller”—this would create an unacceptable single point of failure. To prevent having a single point of failure, there must be at least two controllers. Those controllers must be synchronized in some way.
所以中心化的控制平面面临着很多挑战,比如
So centralized control planes face a number of challenges, such as
• 确保网络的实际状态从连接到链路和目标的设备反映到控制器中
• Ensuring the actual state of the network is reflected from the devices that are connected to links and destinations into the controller
• 确保控制器具有一致的网络视图,或者不一致的网络视图不会以某种方式导致系统性的大规模故障
• Ensuring controllers have a consistent view of the network, or that an inconsistent view of the network does not cause systemic, large-scale failures in some way
• 确保转发数据包所需的信息足够快地在各个转发设备上可用,以便转发不会因为控制和转发平面的分布式特性而受到影响
• Ensuring the information needed to forward packets is available at individual forwarding devices fast enough that forwarding does not suffer because of the distributed nature of the control and forwarding planes
每个领域都有解决方案,但它们通常首先引入与分布式控制平面一样多的复杂性。通常,选择混合模型来平衡分布式控制平面的复杂性和集中式控制平面的复杂性。
There are solutions in each of these spaces, but they often introduce as much complexity as a distributed control plane in the first place. Quite often, hybrid models are chosen to balance between the complexity of distributed control planes and the complexities of centralized control planes.
思考集权和分权的一种有趣方法是通过辅助原则。应用源自托马斯·阿奎那社会学说的辅助性,可能看起来远远超出了工程领域,但考虑一下这个原则本身:
One interesting way to think about centralization and decentralization is through the subsidiarity principle. Applying subsidiarity, which arises out of the social teaching of Thomas Aquinas, might seem to go far afield of engineering, but consider the principle itself:
这一原则认为,规模更大、更复杂的组织不应该做任何事情,而规模更小、更简单的组织也可以做。2
This tenet holds that nothing should be done by a larger and more complex organization which can be done as well by a smaller and simpler organization.2
辅助性原则的“根源”是:决策应尽可能接近决策本身所依赖的信息。将这一原理应用于网络工程意味着要考虑信息的产生位置,并将任何决策者(通常是协议、进程等)放置在尽可能靠近信息源的位置。从 CAP 定理的角度来看,让决策者靠近决策者所依赖的信息源可以减少可用信息和做出决策之间的时间。
The “root” of the subsidiarity principle is this: decisions should be made as close as possible to the information the decisions themselves depend on. Applying this principle to network engineering means thinking about where information is produced and placing any decision maker (generally a protocol, process, etc.) as close to the source of information as possible. Looking at this from a CAP theorem perspective, putting the decision maker close to the source of the information on which the decision maker depends reduces the amount of time between the information being available and the decision being made.
这对网络工程领域意味着什么?政策主要来自于商业决策,商业决策应该贴近业务,而不是拓扑。因此,政策,或者至少是政策的某些要素,通常最好在集中化的情况下完成。然而,拓扑和可达性的基础是关于网络状态的唯一真相来源,即网络本身。因此,与拓扑和可达性相关的决策(从检测到反应)应该与网络本身保持密切联系,这是有意义的;因此,拓扑和可达性决策应该趋向于去中心化。
What does this suggest in the network engineering world? Policy comes primarily out of business decisions, and business decisions should be close to the business, not the topology. Hence, policy, or least some element of policy, is often best done when centralized. Topology and reachability, however, are grounded in what should be the only source of truth about the state of the network, the network itself. Therefore, it makes sense that decisions related to the topology and reachability, from detection to reaction, should be kept close to the network itself; hence, topology and reachability decisions should trend toward being decentralized.
网络工程的世界里没有绝对的事情。如果你还没有找到平衡点,那说明你还没有足够努力地寻找。选择何时何地集中化以及何时何地去中心化也是如此。这些选择经常以非此即彼或绝对选择的形式出现。现实情况是,不同的问题往往需要不同的解决方案。
There are no absolutes in the world of network engineering; if you have not found the tradeoff, you have not looked hard enough. This is true of choosing when and where to centralize, and when and where to decentralize. These choices are presented as either/or absolute choices far too often. The reality is that different problems often require different solutions.
就解决控制平面在现实世界中必须解决的许多不同问题而言,集中化和分布式为网络和协议设计者提供了许多权衡。当考虑提供可达性和拓扑信息的这两种不同方式时,一系列可能的解决方案似乎是显而易见的;例如:
Centralization and distribution, in terms of solving the many different problems control planes must resolve in the real world, provide network and protocol designers with a number of tradeoffs. A range of possible solutions seems obvious when considering these two different ways of providing reachability and topology information; for instance:
• 将可达性和拓扑发现集中在一组分布式控制器中,用集中式控制平面取代分布式控制平面。该模型并没有真正假设“集中式”控制平面,而是消除了对各个交换设备中发现的网络信息的处理。在多个设备上分布信息仍然是创建弹性设计的关键。
• Centralize reachability and topology discovery in a set of distributed controllers, replacing the distributed control plane with a centralized one. This model does not truly assume a “centralized” control plane, so much as it does removing the processing of information discovered about the network out of the individual switching devices. Distribution of the information across a number of devices is still key in creating a resilient design.
• 集中控制平面功能的某些部分(通常是策略),并分发其余部分。PCEP、I2RS 和许多其他“覆盖”控制平面都采用这条路线。
• Centralize some part of the function of the control plane, normally the policy, and distribute the remaining parts. PCEP, I2RS, and many other “overlay” control planes take this route.
• 分散所有控制平面组件,包括可达性、拓扑和推送策略。实际上,以这种方式完全去中心化的大型网络很少;至少有一些政策是正常的。
• Decentralize all of the control plane components, including reachability, topology, and pushing policy. There are actually very few large-scale networks fully decentralized in this way; at least some policy is normally.
最终,对于控制平面在现实世界中带来的问题没有简单的答案。在复杂性理论(状态、优化和表面)和 CAP 定理(一致性、可访问性和分区)隐含的三向权衡之间,设计人员在构建网络和协议时可以做出多种选择。
There is, in the end, no simple answer to the problems control planes pose in the real world. Between the three-way tradeoff implied by complexity theory (state, optimization, and surface) and the CAP theorem (consistency, accessibility, and partitioning), designers can make a wide range of choices when building networks and protocols.
做出明智的选择。
Choose wisely.
阿特拉斯、阿莉亚、大卫·沃德和托马斯·纳多。路由系统接口的问题陈述。征求意见 7920。RFC 编辑器,2016。https: //rfc-editor.org/rfc/rfc7920.txt。
Atlas, Alia, David Ward, and Thomas Nadeau. Problem Statement for the Interface to the Routing System. Request for Comments 7920. RFC Editor, 2016. https://rfc-editor.org/rfc/rfc7920.txt.
比约克伦德,马丁. YANG 1.1 数据建模语言。征求意见 7950。RFC 编辑器,2016。https: //rfc-editor.org/rfc/rfc7950.txt。
Bjorklund, Martin. The YANG 1.1 Data Modeling Language. Request for Comments 7950. RFC Editor, 2016. https://rfc-editor.org/rfc/rfc7950.txt.
大卫·A·波斯尼奇(Bosnich),《辅助性原则》。宗教与自由4,编号。4(2010 年 7 月)。https://acton.org/pub/religion-liberty/volume-6-number-4/principle-subsidiarity。
Bosnich, David A. “The Principle of Subsidiarity.” Religion & Liberty 4, no. 4 (July 2010). https://acton.org/pub/religion-liberty/volume-6-number-4/principle-subsidiarity.
克拉克、乔、贡萨洛·萨尔盖罗和卡洛斯·皮纳塔罗。路由系统 (I2RS) 可追溯性接口:框架和信息模型。征求意见 7922。RFC 编辑器,2016。https: //rfc-editor.org/rfc/rfc7922.txt。
Clarke, Joe, Gonzalo Salgueiro, and Carlos Pignataro. Interface to the Routing System (I2RS) Traceability: Framework and Information Model. Request for Comments 7922. RFC Editor, 2016. https://rfc-editor.org/rfc/rfc7922.txt.
Doria、Avri、Ligang Dong、Weiming Wang、Hormuzd M. Khosravi、Jamal Hadi Salim 和 Ram Gopal。转发和控制元素分离 (ForCES) 协议规范。征求意见 5810。RFC 编辑器,2010。https: //rfc-editor.org/rfc/rfc5810.txt。
Doria, Avri, Ligang Dong, Weiming Wang, Hormuzd M. Khosravi, Jamal Hadi Salim, and Ram Gopal. Forwarding and Control Element Separation (ForCES) Protocol Specification. Request for Comments 5810. RFC Editor, 2010. https://rfc-editor.org/rfc/rfc5810.txt.
哈尔斯、苏珊和马赫·陈。“I2RS 用例要求摘要。” 互联网草案。互联网工程任务组,2016 年 11 月。https ://tools.ietf.org/html/draft-ietf-i2rs-usecase-reqs-summary-03。
Hares, Susan, and Mach Chen. “Summary of I2RS Use Case Requirements.” Internet-Draft. Internet Engineering Task Force, November 2016. https://tools.ietf.org/html/draft-ietf-i2rs-usecase-reqs-summary-03.
哈尔斯、苏珊、吴钦和拉斯·怀特。“基于过滤器的数据包转发 ECA 策略。” 互联网草案。互联网工程任务组,2016 年 10 月。https ://tools.ietf.org/html/draft-ietf-i2rs-pkt-eca-data-model-02。
Hares, Susan, Qin Wu, and Russ White. “Filter-Based Packet Forwarding ECA Policy.” Internet-Draft. Internet Engineering Task Force, October 2016. https://tools.ietf.org/html/draft-ietf-i2rs-pkt-eca-data-model-02.
Medved、Jan、Nitin Bahadur、Hariharan Ananthakrishnan、Xufeng Liu、Robert Varga 和 Alexander Clemm。“网络拓扑的数据模型。” 互联网草案。互联网工程任务组,2017 年 3 月。https: //tools.ietf.org/html/draft-ietf-i2rs-yang-network-topo-12。
Medved, Jan, Nitin Bahadur, Hariharan Ananthakrishnan, Xufeng Liu, Robert Varga, and Alexander Clemm. “A Data Model for Network Topologies.” Internet-Draft. Internet Engineering Task Force, March 2017. https://tools.ietf.org/html/draft-ietf-i2rs-yang-network-topo-12.
梅德韦德、简、尼廷·巴哈杜尔和斯里加内什·基尼。“路由信息库信息模型。” 互联网草案。互联网工程任务组,2016 年 12 月。https ://tools.ietf.org/html/draft-ietf-i2rs-rib-info-model-10。
Medved, Jan, Nitin Bahadur, and Sriganesh Kini. “Routing Information Base Info Model.” Internet-Draft. Internet Engineering Task Force, December 2016. https://tools.ietf.org/html/draft-ietf-i2rs-rib-info-model-10.
Medved、Jan、Robert Varga、Hariharan Ananthakrishnan、Nitin Bahadur、Xufeng Liu 和 Alexander Clemm。“用于第 3 层拓扑的 YANG 数据模型。” 互联网草案。互联网工程任务组,2017 年 1 月。https: //tools.ietf.org/html/draft-ietf-i2rs-yang-l3-topology-08。
Medved, Jan, Robert Varga, Hariharan Ananthakrishnan, Nitin Bahadur, Xufeng Liu, and Alexander Clemm. “A YANG Data Model for Layer 3 Topologies.” Internet-Draft. Internet Engineering Task Force, January 2017. https://tools.ietf.org/html/draft-ietf-i2rs-yang-l3-topology-08.
托马斯·纳多、阿莉亚·阿特拉斯、乔尔·M·哈尔彭、苏珊·哈雷斯和大卫·沃德。路由系统接口的体系结构。征求意见 7921。RFC 编辑器,2016。https: //rfc-editor.org/rfc/rfc7921.txt。
Nadeau, Thomas, Alia Atlas, Joel M. Halpern, Susan Hares, and David Ward. An Architecture for the Interface to the Routing System. Request for Comments 7921. RFC Editor, 2016. https://rfc-editor.org/rfc/rfc7921.txt.
普列托、阿尔贝托·冈萨雷斯、埃里克·沃伊特、安比卡·特里帕蒂、埃纳尔·尼尔森-尼加德、巴拉兹·伦吉尔、安迪·比尔曼和亚历山大·克莱姆。“订阅 YANG 数据存储推送更新。” 互联网草案。互联网工程任务组,2016 年 10 月。https: //tools.ietf.org/html/draft-ietf-netconf-yang-push-04。
Prieto, Alberto Gonzalez, Eric Voit, Ambika Tripathy, Einar Nilsen-Nygaard, Balazs Lengyel, Andy Bierman, and Alexander Clemm. “Subscribing to YANG Data-store Push Updates.” Internet-Draft. Internet Engineering Task Force, October 2016. https://tools.ietf.org/html/draft-ietf-netconf-yang-push-04.
维西基奥、斯特凡诺、洛朗·范贝弗和詹妮弗·雷克斯福德。“甜蜜的小谎言:灵活路由的假拓扑。” 在ACM HotNets中。加利福尼亚州洛杉矶,2014 年。
Vissicchio, Stefano, Laurent Vanbever, and Jennifer Rexford. “Sweet Little Lies: Fake Topologies for Flexible Routing.” In ACM HotNets. Los Angeles, California, 2014.
王力行、Hariharan Ananthakrishnan、Mach Chen、Sriganesh Kini 和 Nitin Bahadur。“用于路由信息库 (RIB) 的 YANG 数据模型。” 互联网草案。互联网工程任务组,2017 年 1 月。https: //tools.ietf.org/html/draft-ietf-i2rs-rib-data-model-07。
Wang, Lixing, Hariharan Ananthakrishnan, Mach Chen, Sriganesh Kini, and Nitin Bahadur. “A YANG Data Model for Routing Information Base (RIB).” Internet-Draft. Internet Engineering Task Force, January 2017. https://tools.ietf.org/html/draft-ietf-i2rs-rib-data-model-07.
1.研究微软的SWAN架构。您如何对这种架构进行分类?控制平面哪些部分是集中式的,哪些部分是分布式的,使用什么接口,南向协议是什么?
1. Research Microsoft’s SWAN architecture. How would you classify this architecture? What parts of the control plane are centralized, what parts are distributed, what interface is used, and what is the southbound protocol?
2.研究Google的FirePath架构。您如何对这种架构进行分类?控制平面哪些部分是集中式的,哪些部分是分布式的,使用什么接口,南向协议是什么?
2. Research Google’s FirePath architecture. How would you classify this architecture? What parts of the control plane are centralized, what parts are distributed, what interface is used, and what is the southbound protocol?
3. 研究 OpenFabric。您如何对这种架构进行分类?控制平面哪些部分是集中式的,哪些部分是分布式的,使用什么接口,南向协议是什么?
3. Research OpenFabric. How would you classify this architecture? What parts of the control plane are centralized, what parts are distributed, what interface is used, and what is the southbound protocol?
4.研究RESTful接口。RESTful 接口和非 RESTful 接口有什么区别?
4. Research RESTful interfaces. What is the difference between a RESTful and non-RESTful interface?
5. 在四种可能的接口中,forCES协议与什么接口交互?
5. Among the four possible interfaces, what interface did the forCES protocol interact with?
6.研究OpenFlow混合模式。为什么OpenFlow协议的开发者放弃了这个想法呢?这种模式将如何改变 OpenFlow 协议的分类?
6. Research OpenFlow hybrid mode. Why did the developers of the OpenFlow protocol abandon this idea? How would this mode have changed the classification of the OpenFlow protocol?
7. 使用BGP作为南向接口的运营商面临的问题之一是缺乏网络拓扑的完整视图。BGP-LS(链路状态)如何解决这个问题?
7. One of the problems facing operators who use BGP as a southbound interface is the lack of a full view of the network topology. How does BGP-LS (Link State) solve this problem?
8. What are the state, surface, and optimization tradeoffs in fibbing?
1 . Atlas、Ward 和 Nadeau,路由系统接口的问题陈述。
1. Atlas, Ward, and Nadeau, Problem Statement for the Interface to the Routing System.
2 . 博斯尼奇,“辅助原则”。
2. Bosnich, “The Principle of Subsidiarity.”
对网络中流量的有意修改或调整并不是网络工程师必须与之交互的唯一策略。信息隐藏虽然通常不被视为一种策略形式,但与构建可扩展、可重复网络的更大目标或策略相关。这些政策会对交通流量产生影响,尽管这些影响往往是无意的而不是有意的——这意味着它们经常被忽视。本章和下一章第 20 章“信息隐藏示例”致力于考虑这一问题、解决方案空间以及一些广泛使用的解决方案实现。本章的第一部分将研究问题空间,第二部分将研究各种问题可用于解决该问题的解决方案,第三部分将考虑网络复杂性背景下的信息隐藏。
The intentional modification or shaping of traffic flows across a network is not the only kind of policy that network engineers must interact with. Information hiding, while not often considered a form of policy, relates to the larger goals, or policies, of building scalable, repeatable networks. These policies have consequences in terms of traffic flow, although these consequences are often unintentional rather than intentional—which means they are often ignored. This chapter and the next, Chapter 20, “Examples of Information Hiding,” are dedicated to considering this one problem, the solution space, and some widely used solution implementations. The first section in this chapter will examine the problem space, the second various kinds of solutions that can be used to counter the problem, and the third section will consider information hiding in the context of network complexity.
控制平面旨在了解并携带尽可能多的有关网络拓扑和可达性的信息。一旦花费了处理和内存来学习它,为什么网络工程师想要限制这种状态的范围呢?有几个答案,包括
Control planes are designed to learn about and carry as much information about the network topology and reachability as possible. Why would network engineers want to limit the scope of this state, once the processing and memory have been spent to learn it? There are several answers, including
• 减少参与控制平面的设备的资源利用率,通常只是为了节省成本
• To reduce resource utilization in devices participating in the control plane, generally just to save costs
• 防止网络某一部分的故障影响网络的其他部分;换句话说,将网络分解为故障域
• To prevent a failure in one part of a network from impacting some other part of the network; in other words, to break up the network into failure domains
• 防止向攻击者泄漏有关网络拓扑以及连接到网络的目的地的可达性的信息;换句话说,减少网络的攻击面
• To prevent leaking information about the topology of the network, and reachability to destinations attached to the network, to attackers; in other words, to reduce the network’s attack surface
• 防止可能导致网络完全故障的正反馈循环
• To prevent positive feedback loops that can cause a complete network failure
上面列出的问题可以分为两类:缩小控制面信息的范围和降低允许控制面信息改变的速度。这些将在以下两节中考虑。
The problems in the preceding list can be divided into two categories: reducing the scope of control plane information and reducing the speed at which control plane information is allowed to change. These will be considered in the two following sections.
图19-1说明了控制平面状态的范围。
Figure 19-1 illustrates the scope of control plane state.
控制平面承载两种状态:拓扑状态和可达性。这两种控制平面状态在网络中可以有不同的范围。例如:
There are two kinds of state carried by the control plane: topology and reachability. These two kinds of control plane state can have different scopes in a network. For instance:
• 如果D 知道2001:db8:3e8:100::/64,则该可达性信息的范围是A、B、C 和D——整个网络。
• If D has knowledge of 2001:db8:3e8:100::/64, then the scope of this reachability information is A, B, C, and D—the entire network.
• 如果 C 知道 2001:db8:3e8:100::/64,而 D 不知道,则此可达性信息的范围是 A、B 和 C。
• If C has knowledge of 2001:db8:3e8:100::/64, and D does not, then the scope of this reachability information is A, B, and C.
• 如果D 知道连接A 和B 的链路,或者A 和B 相邻,则该拓扑信息的范围是A、B、C 和D——整个网络。
• If D knows about the link connecting A and B, or that A and B are adjacent, the scope of this topology information is A, B, C, and D—the entire network.
• 如果D 不知道连接A 和B 的链路,或者A 和B 相邻,则该拓扑信息的范围是A、B 和C。
• If D does not know about the link connecting A and B, or that A and B are adjacent, the scope of this topology information is A, B, and C.
看待这个问题的另一种方法是问:如果到特定目的地的链路或可达性发生故障,哪些设备必须参与聚合?任何不参与收敛的设备(可能通过发送更新、重新计算通过网络的无环路路径集或切换到备用路径)都不属于故障域。任何确实需要发送更新、重新计算无环路路径集或切换到备用路径的设备都是故障域的一部分。因此,故障的范围决定了故障域的范围。在图 19-1中:
Another way to look at this is to ask: if a link or reachability to a specific destination fails, which devices must participate in convergence? Any device that does not participate in convergence, perhaps by sending an update, recalculating the set of loop-free paths through the network, or switching to an alternate path, is not part of the failure domain. Any device that does need to send an update, recalculate the set of loop-free paths, or switch to an alternate path is part of the failure domain. The scope of a failure, then, determines the scope of the failure domain. In Figure 19-1:
• 如果 D 知道 2001:db8:3e8:100::/64,则当 100::/64 与 A 断开连接时,D 必须重新计算其可达目的地集;因此 D 是该目的地的故障域的一部分。
• If D has knowledge of 2001:db8:3e8:100::/64, then D must recalculate its set of reachable destinations if 100::/64 is disconnected from A; hence D is part of the failure domain for this destination.
• 如果D 不知道2001:db8:3e8:100:/64,则当100::/64 与A 断开连接时,D 不会更改其本地转发信息;因此 D 不是该目的地的故障域的一部分。
• If D does not have knowledge of 2001:db8:3e8:100:/64, then D does not change its local forwarding information when 100::/64 is disconnected from A; hence D is not part of the failure domain for this destination.
如果D 知道A 和B 之间的链路,那么如果链路发生故障,D 需要重新计算通过网络的无环路路径集(以及通过该链路的任何可达性信息);因此 D 是该特定链路的故障域的一部分。
• If D has knowledge of the link between A and B, then D needs to recalculate the set of loop-free paths through the network if the link fails (along with any reachability information passing through the link); hence D is part of the failure domain for this specific link.
• 如果D 不知道A 和B 之间的链路,那么当链路发生故障时D 不需要重新计算任何内容;因此 D 不是故障域的一部分。
• If D does not have knowledge of the link between A and B, then D does not need to recalculate anything when the link fails; hence D is not part of the failure domain.
此定义意味着必须为每条可达性和拓扑信息确定故障域。虽然协议和网络设计将阻止网络中公共点的可达性和/或拓扑,在某些情况下
This definition means failure domains must be determined for each piece of reachability and topology information. While protocols and network designs will block reachability and/or topology at common points in a network, there are cases in which
• 拓扑信息被阻止,但可达性信息未被阻止。
• Topology information is blocked, but not reachability information.
• 某些可达性信息被阻止,但不是全部。
• Some reachability information is blocked, but not all.
• 某些可达性或拓扑信息泄漏,导致抽象泄漏。
• Some reachability or topology information leaks, causing a leaky abstraction.
网络内控制平面信息的范围很重要,因为它对控制平面收敛的速度有很大影响。由于拓扑或可达性的变化而需要重新计算的每个附加设备都表示网络将保持不收敛的一段时间,因此要么某些目的地将不必要地无法到达,要么数据包将在网络中的某些链路集上循环,因为某些路由器对网络拓扑的看法与其他路由器不同。循环尤其是一个问题,因为循环通常有可能成为正反馈循环,这可能导致控制平面无法永久收敛。
The scope of control plane information within a network is important because it has a very large impact on the speed at which the control plane converges. Each additional device required to recalculate because of a change in topology or reachability represents some amount of time the network will remain unconverged, and hence either some destinations will be unnecessarily unreachable, or packets will be looped across some set of links in the network because some routers have a different view of the network topology than others. Looping, in particular, is a problem, because loops quite often have the potential to become positive feedback loops, which can cause the control plane to fail to converge permanently.
正反馈循环比控制平面信息的范围更难以想象;图 19-2说明了这一点。
Positive feedback loops are a bit harder to imagine than the scope of control plane information; Figure 19-2 illustrates.
在图19-2中,有四个设备:
In Figure 19-2, there are four devices:
• 设备 A,将其从信号输入接收到的内容与从 B 接收到的内容相加
• Device A, which adds whatever it receives from the signal input and what it receives from B
• 设备 B,可以增加或减少从 C 接收的信号的大小或频率
• Device B, which can either increase or decrease the size or frequency of the signal it receives from C
• 设备 C,将信号原封不动地传递给 D,并对信号进行采样,将样本发送给 B
• Device C, which passes the signal along unchanged to D, and also samples the signal, sending the sample to B
• 设备 D,测量信号
• Device D, which measures the signal
要创建一个简单的反馈环路,假设 C 对通过它的信号的一部分进行采样,并将该样本传递给 B。设备 B 反过来将样本放大一定倍数,并将放大后的信号传递回 A。图19-图3显示了结果。
To create a simple feedback loop, assume C samples some fraction of the signal passing through it, passing this sample to B. Device B, in turn, amplifies the sample by some factor, and passes this amplified signal back to A. Figure 19-3 shows the result.
案例如图19-3所示是一个正反馈循环;C 放大它接收到的样本,使信号变大一点。D 处的结果是幅度不断增加的信号。这个反馈循环什么时候停止?当某些限制因素受到打击时。例如,A 可能达到某个限制,无法继续添加两个信号,或者 C 可能达到某个输入信号限制并发生故障,释放出其神奇的烟雾(如果输入功率太大,所有电子设备都会这样做)。还可以建立一个负反馈环路,其中 C 每个周期都会去除一点功率;在正弦波的情况下(如此处所示),这将需要 C 反转从 A 接收的样本。最后,可以配置该电路中的每个组件,既不增加也不减少 D 处的最终输出。这个案例,
The case shown in Figure 19-3 is a positive feedback loop; C amplifies the sample it receives, making the signal just a bit larger. The result, at D, is a signal with constantly increasing amplitude. When will this feedback loop stop? When some limiting factor is hit. For instance, A may reach some limit where it cannot continue to add the two signals, or perhaps C reaches some input signal limit and fails, releasing its magic smoke (as all electronics will do if driven with too much input power). It is also possible to set up a negative feedback loop, where C removes a slight bit of power each cycle; in the case of a sine wave (as shown here), this would require C to invert the sample it receives from A. Finally, it is possible to configure each component in this circuit to neither increase nor decrease the final output at D. In this case, C would be somehow tuned to compensate for any inefficiency in the wiring, or A or C’s operation, by injecting just enough feedback to A to keep the signal at the same power at D.
图 19-4将输出信号的幅度更改为事件的频率来说明原因。
Figure 19-4 changes the amplitude of the output signal to the frequency of an event to illustrate why.
在图 19-4中,B(如前面的图 19-2所示)被编程为为其接收的每对事件发送一个事件。在原始信号输入中,有六个事件信号,因此 B 在通往 A 的反馈路径中又添加了三个事件信号。在第二轮中,如中心列所示,输入信号中的原始六个事件被添加到来自 B 的三个事件中,产生九个事件信号。基于这九个事件C、B 输出的信号将产生 4 个事件信号并将其反馈给 A。结果是 A 的输出现在有 10 个事件信号。信号数量的这种增加将持续下去,直到整个时间空间被事件信号饱和。
In Figure 19-4, B (as shown previously in Figure 19-2) is programmed to send a single event for every pair of events it receives. In the original signal input, there are six event signals, so B adds three more into the feedback path toward A. In the second round, shown in the center column, the original six events from the input signal are added to the three from B, resulting in nine event signals. Based on these nine event signals at the output of C, B will generate four event signals and feed them back to A. The result is that the output of A now has ten event signals. This increase in the number of signals will continue until the entire time space is saturated with event signals.
物理和逻辑环路可能导致链路饱和、设备耗尽处理能力或内存,或者最终导致网络故障的许多其他情况。示例如图19-5所示。
Physical and logical loops can cause links to become saturated, devices to run out of processing power or memory, or a number of other conditions that will eventually cause a network failure. Figure 19-5 is used to provide an example.
假设图 19-5中的每个路由器每秒能够处理 10 次网络更改(例如路由或拓扑更改),并且路由表中总共有 5 条路由。由于接口的速度(或其他原因),通过网络传输更新的顺序始终为 [D,A,C,B];从 D 通过 [A,C] 的更新始终先于直接通过 [D,A] 的更新到达 B。
Assume that each router in Figure 19-5 is capable of processing ten changes to the network per second—either a route or topology change, for instance—and there are five routes total in the routing table. Because of the speeds of the interfaces (or for some other reason), the order in which updates are transmitted through the network is always [D,A,C,B]; updates from D through [A,C] always arrive at B before updates through [D,A] directly.
2001:db8:3e8:100::/64 链路开始每秒抖动 3 次。看起来网络应该收敛于这个抖动速率;毕竟,这是任何设备都能支持的速率的 50%。然而,要了解反馈循环的影响,跟踪整个收敛过程非常重要:
The 2001:db8:3e8:100::/64 link begins to flap three times per second. It seems like the network should converge on this flap rate fine; it is 50% of the rate at which any device can support, after all. To understand the impact of the feedback loop, however, it is important to trace the entire process of convergence:
• 每次100:/64 链路出现故障或恢复时,D 都会向A 发送更新;这是 3 次失败和 3 次恢复,每秒总共发生 6 个事件。
• Each time the 100:/64 link fails or comes up, D sends an update to A; this is three failures and three recoveries, for a total of six events per second.
• 对于每个事件,D 都会向 A 发送更新。
• For each of these events, D will send an update to A.
• 对于每个事件,A 都会向 B 和 C 发送更新。
• For each of these events, A will send an update to B and C.
• B 还将针对其收到的每个更新向 C 发送更新;这实际上使 C 处的事件发生率增加了一倍,达到每秒 12 个。
• B will also send an update toward C for each update it receives; this effectively doubles the rate of events at C to 12 per second.
在第一秒内,C 每秒接收 12 个事件,它将失败,进而断开与 A 和 B 的关系。当它恢复时,它将尝试与每个连接的路由器建立新的邻接关系,这意味着它将将包含五个路由的整个数据库发送到 A 和 B。鉴于 100::/64 链路仍然以相同的速率振荡,这将使 B 超过其阈值,导致 B 崩溃。A 也有可能崩溃(取决于时间)。
During the first second C receives 12 events per second, it will fail, in turn taking down its relationships with A and B. When it comes back up, it will attempt to establish new adjacencies with each of the connected routers, which means it will send its entire database, containing five routes, to A and B. Given the 100::/64 link is still flapping at the same rate, this will drive B above its threshold, causing B to crash. It is possible, as well (depending on the timing), that A could crash.
一旦 A 崩溃,由于资源耗尽而导致的崩溃链将继续下去——如果时机正确,或者崩溃形成自己的自支撑反馈循环,即使原始的抖动链路已修复。尽管这种反馈循环没有被标记为故障的根本原因(在此示例中,振荡链路将被视为故障的根本原因),但它们通常是将单个事件转变为控制平面的完全故障的原因收敛。
Once A crashes, the chain of crashes through resource exhaustion will continue— if the timing is correct, or the crashes form their own self-supporting feedback loop, even if the original flapping link is repaired. Although feedback loops of this kind are not tagged as the root cause of the failure (the flapping link would be considered the root cause of the failure in this example), they are often what turns a single event into a complete failure of the control plane to converge.
多年来已经开发了许多解决方案来限制控制平面状态,包括汇总、聚合、过滤、分层、缓存和回退计时器。所有这些解决方案都属于限制控制平面状态的两种不同方式之一——减少控制平面信息的范围或速度。其中每一个依次解决一个特定的问题,例如
A number of solutions have been developed over the years to limit control plane state, including summarization, aggregation, filtering, layering, caching, and back-off timers. All of these solutions fall into one of two different ways to limit control plane state—reducing the scope or the speed of control plane information. Each of these, in turn, solves a specific problem, such as
• 通过控制可获取网络视图的设备集,缩小控制平面信息的范围可提高安全性。
• Reducing the scope of control plane information improves security by controlling the set of devices through which a view of the network can be obtained.
• 减少控制平面信息的范围可以通过控制因任何单独更改而必须重新计算通过网络的无环路路径的设备组来提高收敛性。
• Reducing the scope of control plane information improves convergence by controlling the set of devices that must recalculate loop-free paths through the network because of any individual change.
• 减少控制平面信息的范围,通过防止状态通过控制平面“环回”来减少正反馈循环的机会。
• Reducing the scope of control plane information reduces the chance of positive feedback loops by preventing state from “looping back” through the control plane.
减少控制平面信息的范围,通过减少内存中保存的任何表的大小以及无环路路径集必须跨越的表的大小,减少任何特定设备中资源耗尽的可能性(并可能降低任何特定设备的成本)。被计算。
• Reducing the scope of control plane information reduces the chance of resource exhaustion in any particular device (and potentially lowers the cost of any particular device) by reducing the size of any tables held in memory and across which the set of loop-free paths must be calculated.
• 降低控制平面信息通过网络传输的速度或状态速度,可以减少形成正反馈循环的机会,并减少任何单个设备中资源耗尽的机会。
• Reducing the speed of control plane information traveling through the network, or the velocity of state, reduces the chance of positive feedback loops forming and reduces the chance of resource exhaustion in any individual device.
以下各节考虑了几种广泛实施和部署的用于控制状态范围和速度的技术。
The following sections consider several widely implemented and deployed techniques used to control the scope and velocity of state.
拓扑信息可以通过以下方式进行概括:使物理(或虚拟)连接几跳的目的地看起来直接连接到本地节点,然后从控制平面中携带的任何路由信息中删除有关链路和节点的信息。总结点。图19-6从F的角度阐释了这个概念,E进行了总结。
Topological information can be summarized by making destinations that are physically (or virtually) connected several hops away appear to be directly attached to a local node, and then removing the information about the links and nodes in any routing information carried in the control plane from the point of summarization. Figure 19-6 illustrates this concept from the perspective of F, with E summarizing.
在拓扑汇总之前(上层网络),F 可能(取决于协议)知道 A 连接到 B,B 连接到 C 和 D,C 和 D 连接到 E。如果 E 开始汇总拓扑根据信息(如下层网络所示),从 F 的角度来看,这些其他节点中的每一个都直接连接到 E。当然,物理拓扑不会改变,但 F 的拓扑视图确实会改变。
Before the topology is summarized (the upper network), F might (depending on the protocol) know A is connected to B, B is connected to C and D, and C and D are connected to E. If E begins to summarize the topology information (shown in the lower network), each of these other nodes appears, from F’s perspective, to be directly connected to E. The physical topology does not change, of course, but F’s view of the topology does change.
汇总是对网络拓扑的一种抽象形式;可到达的目的地集是从网络中抽象出来并连接起来的,以便保留无环路路径,但不保留详细的拓扑信息。通常完成此操作的方式是删除实际链路信息,同时保留与每个目的地相关联的度量信息,因为度量信息单独可用于计算无环路路径。
Summarization is a form of abstraction over the network topology; the set of reachable destinations is abstracted from the network and connected so that loop-free paths are preserved, but not detailed topology information. The way this is normally done is to remove actual link information while preserving the metric information associated with each destination, as the metric information alone can be used to calculate loop-free paths.
距离矢量协议本质上总结了每一跳的拓扑信息,因为它们在设备之间使用度量传输每个目的地。在 Bellman-Ford 中,本地设备检查其网络的本地视图,以计算通过网络的无环路径集。Garcia-Luna 的扩散更新算法(DUAL),设备保留(实际上)一跳拓扑信息,即从每个邻居看到的到每个目的地的成本,并使用此信息来计算到每个目的地的备用无环路路径。链路状态协议在单个泛洪域内携带完整的拓扑信息,包括链路和度量。
Distance vector protocols essentially summarize topology information at every hop, as they transmit each destination with a metric between devices. In Bellman-Ford, the local device examines its local view of the network to calculate the set of loop-free paths through the network. In Garcia-Luna’s Diffusing Update Algorithm (DUAL), the device keeps (in effect) one hop of topology information, the cost to each destination as seen from each of its neighbors, and uses this information to calculate alternate loop-free paths to each destination. Link state protocols carry full topology information, including links and metrics, within a single flooding domain.
如果您通过一系列航班前往遥远的城市,您将需要
If you take a trip to a distant city through a series of flights, you will need
• 从您家到当地机场的路线
• Directions from your home to the local airport
• 当地机场内前往正确登机口登机的路线
• Directions within the local airport to the correct gate to board the aircraft
• 每个航班转机的机场内从登机口到登机口的路线
• Directions from gate to gate within the airport where each flight connection is made
• 从大门到租车地点、出租车或某种公共交通的路线
• Directions from the gate to the place where you pick up a rental car, or to a taxi, or to some form of public transportation
• 前往您将入住的酒店的路线
• Directions to the hotel where you will be staying
• 从酒店前往您将参加的会议地点的路线
• Directions from the hotel to the site of the meeting or conference you will be attending
如果您致电目的地酒店并询问从您所在酒店前往目的地的完整路线,会发生什么情况?假设酒店工作人员甚至知道您的旅行方式,这些指示很容易让您不知所措。也许它们看起来像这样:
What would happen if you called your destination hotel and asked for full directions to its location from yours? Assuming the hotel staff even know how you are traveling, the directions would easily overwhelm you. Maybe they would look something like this:
1. 走出前门并进入汽车。
1. Walk out your front door and get into your car.
2. 左转出车道,前往第一个停车标志,然后左转。
2. Turn left out of your driveway, go to the first stop sign, turn left.
3. 前行三个街区,然后右转进入高速公路入口坡道。
3. Proceed three blocks and turn right onto the entrance ramp onto the highway.
4. 汇入车流并在这条路上行驶 4.1 英里。
4. Merge into traffic and stay on this road for 4.1 miles.
5.……
5. …
6. 下飞机后,在出登机门时左转。
6. When you disembark from the plane, turn left on exiting the gate.
7. 行驶 400 码到达机场内部交通站。
7. Travel 400 yards to the internal airport transportation station.
8. 登上台阶或自动扶梯至二楼,左转,搭乘第一趟到达的火车。
8. Ascend the steps or escalator to the second level, turn left, and board the first train arriving there.
9. 在第三站下车,左转,沿着台阶或自动扶梯前往一楼。
9. On the third stop, exit the train, turn left, and proceed down the steps or escalator to the first floor.
10.…
10. …
您可以看到这样一组指示的范围可能是巨大的。事实上,它们会如此势不可挡,以至于令人困惑。
You can see how such a set of directions might be overwhelming in their scope. In fact, they would be so overwhelming as to be confusing.
笔记
Note
下行自动扶梯为什么叫自动扶梯?既然是往下走,那不是应该叫扶梯吗?
Why are down escalators called escalators? Since they go down, shouldn’t they be called descalators?
旅行者真正的导航方式是分阶段或分段的。提供了一系列广泛的指示(搭乘 123 号航班,将带您前往芝加哥;然后搭乘 456 号航班,将带您前往圣何塞;租一辆车;然后开车前往酒店)。在每个步骤中,您都假设当地有可用的指示可带您前往任意两点。例如,您假设当地高速公路上会有标志,或者您可以查阅一些软件或地图来为您提供从家到当地机场的路线,然后在您转机的机场内也会有标志在大门等之间引导您
The way travelers really navigate is in stages, or segments. A broad set of directions is given (board flight 123, which will take you to Chicago; then flight 456, which will take you to San Jose; rent a car; and drive to the hotel). At each of these steps, you assume there will be directions available locally to take you between any two points. For instance, you assume there will be signs on the local highway, or some software or map you can consult to provide you with directions from your home to the local airport, and then there will be signs within the airport where you are connecting between flights to guide you between the gates, etc.
这种分阶段旅行的过程实际上是一种抽象形式。您知道,当您旅行时,这些信息将在您继续旅行时变得可用,因此您现在不需要它。您需要的是足够的信息来让您进入一般区域,然后在到达那里后访问更详细的信息。
This process of taking a trip in stages is, in reality, a form of abstraction. You know, when you travel, that information will become available as you proceed through the trip, and hence you do not need it right now. What you need is enough information to get you into a general area and then access to more detailed information when you get there.
这正是网络协议中聚合的工作原理。当网络中覆盖拓扑距离时,聚合会删除有关特定目的地的更多具体信息。图 19-7说明了这一点。
This is precisely how aggregation in network protocols works. Aggregation removes more specific information about a particular destination as topological distance is covered in the network. Figure 19-7 illustrates.
在图 19-7中,有三台主机连接到 A 上的一个接口上的单个共享链路(广播域)。其中每台主机都有自己的物理媒体访问控制 (MAC) 地址,该地址与 Internet 协议相关(IP) 地址,已手动或通过动态主机配置协议 (DHCP) 分配。这些地址都属于单个 /64 地址范围。A 将这些主机地址聚合到单个通告中,传统上被视为 IP 网络中“线路”的地址:2001:db8:3e8:100::/64。
In Figure 19-7, there are three hosts connected to a single shared link (broadcast domain) attached to an interface on A. Each of these hosts has its own physical Media Access Control (MAC) address, which is related to an Internet Protocol (IP) address, which has been assigned either manually or through the Dynamic Host Configuration Protocol (DHCP). These addresses all fall within a single /64 range of addresses. A aggregates these host addresses into a single advertisement, traditionally considered the address of the “wire” in IP networks: 2001:db8:3e8:100::/64.
另外两个路由器 B 和 C 正在通告另外两个 /64;A、B 和 C 通告的三个 /64 属于同一 /60 地址范围。路由器 D 配置为将这三个 /64 聚合到 /60。E 又向 F 通告一条默认路由 (::/0),这意味着“任何你不知道的 IP 地址,都可以通过我到达”。这是“高于”2001:db8:3e8:100::/60 的总计。一些有用的术语:
Two other routers, B and C, are advertising two other /64s; the three /64s advertised by A, B, and C fall within the same /60 address range. Router D is configured to aggregate these three /64s to the /60. E, in turn, advertises a default route (::/0) to F, which means “any IP address you do not know about, you can reach through me.” This is an aggregate sitting “above” 2001:db8:3e8:100::/60. Some useful terminology:
•超网或聚合:覆盖或代表一组较长前缀或更具体目的地的地址
• Supernet or aggregate: An address that covers, or represents, a set of longer prefix, or more specific, destinations
•子网:路由表中较长前缀或较不具体的目的地所覆盖或表示的地址
• Subnet: An address that is covered, or represented by, a longer prefix, or less specific, destination in the routing table
子网和聚合在任何单个设备的路由表中看起来都是相同的。查看特定路由是超网还是子网的唯一方法是较长和较短的路由是否同时存在于聚合设备的路由表中。如果没有子网,您就无法判断路由是否是聚合路由。
Subnets and aggregates look identical in the routing table of any individual device. The only way you can see if a particular route is either a supernet or subnet is if the longer and shorter routes both exist in the routing table of the aggregating device at the same time. Without the subnet, you cannot tell whether a route is an aggregate or not.
A、在广告 2001:db8:3e8:100::/64 中,不会删除网络的任何可达性;相反,它添加了控制平面似乎可以到达的无法到达的目的地。路由器 A 正在向大量主机通告可达性,例如 2001:db8:3e8:100:4//64,即使该主机不存在。同样,D 通过广播 2001:db8:3e8:100::/60 将不可达地址空间广播到网络中,E 通过广播 ::/0 将不可达地址空间广播到网络中。
A, in advertising 2001:db8:3e8:100::/64, does not remove any reachability from the network; rather it adds unreachable destinations that appear to be reachable to the control plane. Router A is advertising reachability to a large number of hosts, such as 2001:db8:3e8:100:4//64, even though this host doesn’t exist. In the same way, D is advertising unreachable address space into the network by advertising 2001:db8:3e8:100::/60, and E is advertising unreachable address space into the network by advertising ::/0.
传输到不存在的主机的数据包通常会被第一个具有足够特定路由信息的设备丢弃,以了解该主机不存在。例如:
Packets transmitted to a nonexistent host are normally just dropped by the first device with specific enough routing information to know the host doesn’t exist. For instance:
• 如果F 向E 转发目标地址为2001:db8:3e8:110::1 的数据包,则E 可以丢弃该数据包,因为该目标不属于E 路由表中的任何可用目标。
• If a packet is forwarded by F toward E with a destination address of 2001:db8:3e8:110::1, E can drop this packet, as this destination does not fall within any of the available destinations in E’s routing table.
• 如果数据包由F 转发至E,目标地址为2001:db8:3e8:103::1,则D 可以丢弃该数据包,因为该目标不属于D 路由表中的任何可用目标。
• If a packet is forwarded by F toward E with a destination address of 2001:db8:3e8:103::1, D can drop the packet, as this destination does not fall within any of the available destinations in D’s routing table.
• 如果 F 向 E 转发目标地址为 2001:db8:3e8:100::100 的数据包,则 A 需要丢弃该数据包,因为该目标不在本地地址解析协议 (ARP) 缓存中A 与 2001:db8:3e8:100::/64 的连接。
• If a packet is forwarded by F toward E with a destination address of 2001:db8:3e8:100::100, A would need to drop the packet, as this destination is not in the local Address Resolution Protocol (ARP) cache at A’s connection to 2001:db8:3e8:100::/64.
网络中还有另一个可以配置聚合的地方:在单个网络设备内的路由表(路由信息库或 RIB)和转发表(转发信息库或 FIB)之间。这种类型的聚合相当不寻常。它主要用于设备转发表由于内存限制而被限制为特定大小的情况。
There is another place where aggregation can be configured in a network: between the routing table (Routing Information Base, or RIB) and the forwarding table (Forwarding Information Base, or FIB), within an individual network device. This type of aggregation is fairly unusual; it is primarily used in situations where a device’s forwarding table is restricted to a particular size because of memory limitations.
与聚合不同,过滤可达性信息确实会从控制平面中删除可达性信息;因此,过滤通常用作网络安全分层防御的辅助手段或一部分。用图19-8来说明。
Filtering reachability information, unlike aggregation, does remove reachability information from the control plane; hence filtering is normally used as an aid or part of a layered defense for network security. Figure 19-8 is used to illustrate.
图19-8中,A 应该能够到达组织内部的 E(组织边界线的右侧),并且没有组织外部的目的地。例如,主机 A 绝对不应该能够到达 G,或者组织网络内的任何传输链路或路由器。当然,有多种方法可以实现这一点。网络管理员可以在网络边缘放置一个有状态的数据包过滤器,以阻止不属于源自网络内部的会话的流量,或者网络管理员可以配置一个数据包过滤器来阻止 A 访问除 E 之外的任何目的地.当然,这些都是好主意,通常最好将此类过滤器与某些控制平面过滤器结合起来,以防止 A 所连接的网络(云内)中的任何路由器了解这些目的地。为了实现这一点,网络管理员可以在 B 处放置一个过滤器,阻止除 E 所连接的子网之外的网络内任何可到达目的地的通告。
In Figure 19-8, A should be able to reach E within the organization (to the right of the organizational boundary line) and no destinations outside the organization. Host A definitely should not be able to reach G, for instance, or any of the transit links or routers within the organization’s network. There are several ways to accomplish this, of course. The network administrator could place a stateful packet filter at the edge of the network to block traffic that is not part of a session originating from inside the network, or the network administrator could configure a packet filter to block A from accessing any destination other than E. While these are, of course, good ideas, it is often best to combine such filters with some control plane filter to prevent any routers in the network that A is attached to (within the cloud) from learning about these destinations. To accomplish this, the network administrator can place a filter at B blocking the advertisement of any reachable destination within the network other than the subnet that E is attached to.
在 D 处,除默认路由外,所有路由也会被过滤到 F。虽然这在 D 上配置为路由过滤器,但它的作用类似于路由聚合;即使 F 没有特定路由,默认情况下仍然允许 G 通过遵循默认路由到达 E。区分这两种情况很重要:路由过滤器像聚合一样使用,路由过滤器用于阻止或阻止特定设备(或一组设备)的可达性。
At D, all routes are also filtered toward F—except the default route. While this is configured as a route filter on D, it acts like route aggregation; the default still allows G to reach E, even though F does not have a specific route, by following the default route. It is important to differentiate between the two cases: a route filter being used like aggregation and a route filter being used to prevent or block reachability to or from a particular device (or set of devices).
在第 9 章“网络虚拟化”中,从数据平面的角度阐述了构建虚拟拓扑的案例:主要是提供流量分离、可达性分离,并提供“顶层”网络服务,特别是加密和隧道服务。协议支持。对于分层控制平面有一个完全独立的情况,无论是否具有虚拟化拓扑。考虑之前图 19-8中列出的安全示例;解决同一问题的另一种方法可能是配置覆盖网络,如图19-9所示。
In Chapter 9, “Network Virtualization,” the case for building virtual topologies was laid out from the perspective of the data plane: primarily to provide traffic separation, reachability separation, and to provide “over the top” network services, particularly encryption and tunneled protocol support. There is an entirely separate case to be made for layering control planes, either with virtualized topologies, or without. Consider the security example set out previously in Figure 19-8; another way to solve the same problem might be to provision an overlay network, as shown in Figure 19-9.
图19-9中,A需要访问H和K,但不需要访问M;N 需要访问所有三个。路由器 B 是一个较小的设备,可能是一个小型家庭办公室路由器,只能支持少量路由。当然,可以在 C 处过滤路由信息,使 B 只具有其所需的一两条路由,但从网络管理的角度来看,这可能无法扩展。这也不提供流量分离,而这是许多使用覆盖网络的地方的要求。满足任何流量分离要求都需要在路径上的每个设备上构建数据包过滤器,从而进一步增加网络管理负载。
In Figure 19-9, A needs to access H and K, but not M; N needs to access all three. Router B is a smaller device, perhaps a small home office router, which can support just a handful of routes. It is possible, of course, to filter routing information at C such that B has just the one or two routes it needs, but this may not be scalable from a network management perspective. Nor does this provide traffic separation, which is a requirement in many places where overlay networks are used. Meeting any traffic separation requirements would necessitate building packet filters at every device along the path, adding further to the network management load.
在许多情况下,更好的选择是创建一个仅包含需要通信的设备的虚拟覆盖网络。在这种情况下,灰色虚线代表为满足给定的要求而创建的虚拟覆盖网络。从信息隐藏的角度来看,需要注意以下几点:
A better option, in many cases, is to create a virtual overlay network including just the devices that need to communicate. In this case, the dashed gray lines represent the virtual overlay network created to fulfill the requirements given. From an information hiding perspective, what is important to note is the following:
• B 不需要知道D 或G、连接它们的链路,也不需要知道2001:db8:3e8:102::/64 子网;通过构建一条隧道或虚拟拓扑(一端位于 B,另一端位于 E 和 F),有关这些拓扑元素和可到达目的地的信息对 B 处的控制平面隐藏。
• B does not need to know about D or G, the links connecting them, nor the 2001:db8:3e8:102::/64 subnet; information about these topology elements and reachable destinations are hidden from the control plane at B by building a tunnel, or virtual topology, with one end at B and the other ends at E and F.
• 第二个控制平面可以作为C、E 和F 上的不同进程运行;第二个控制平面也不需要了解这些拓扑元素或可到达的目的地。
• The second control plane can run as a different process on C, E, and F; this second control plane also does not need to know about these topology elements or reachable destinations.
那么,一些有关拓扑和可达性的信息对 B 完全隐藏,并且对 C、E 和 F 的一些处理隐藏,而不会降低所需的可达性。为了将其与故障域的概念联系起来,当这些(隐藏的)元素发生变化时,不知道特定拓扑元素和/或可到达目的地的路由器不需要重新计算通过网络的无环路路径集。因此,B 可以说处于与 D 和 G 不同的故障域中。因此,虚拟化通常可以被视为另一种形式的信息隐藏。
Some information about topology and reachability, then, is hidden from B entirely, and some processes on C, E, and F, without reducing the required reachability. To connect this back to the concept of failure domains, routers that do not know about specific topology elements and/or reachable destinations do not need to recalculate the set of loop-free paths through the network when those (hidden) elements change. Because of this, B can be said to be in a different failure domain than D and G. Virtualization, then, can often be treated as another form of information hiding.
缓存从一个简单的观察开始:并非所有转发信息都一直被使用。相反,特定的流沿着网络中的特定路径传递,并且特定的设备对(通常)仅在短时间内进行通信。在远离任何特定流可能使用的路径的设备中存储短期流的转发信息是一种资源浪费。用图19-10来说明。
Caching begins with a simple observation: not all forwarding information is used all the time. Rather, particular flows pass along particular paths in a network, and particular pairs of devices (typically) only communicate for short periods of time. Storing forwarding information for short-lived flows, and in devices far off the path any particular flow might use, is a waste of resources. Figure 19-10 is used to illustrate.
图19-10中,A到2001:db8:3e8:100::/64的路径没有经过C、E、F;如果 A 是唯一发起通往此目的地的路径的设备,则 C、E 和 F 计算到 100::/64 目的地的最短路径会浪费内存和处理能力。但是 E 如何知道连接到 101::/64 的主机不会向连接到 100::/64 的某个设备发送流量呢?从控制平面的角度来看,没有办法知道这一点。
In Figure 19-10, the path from A to 2001:db8:3e8:100::/64 does not pass through C, E, or F; if A is the only device that ever originates paths toward this destination, it is a waste of memory and processing power for C, E, and F to calculate shortest paths to the 100::/64 destination. But how would E know no host attached to 101::/64 is going to send traffic to some device connected to 100::/64? There is no way, from a control plane perspective, to know this.
相反,E 必须依赖通过网络的流量。例如,当某个数据包从本地连接的主机向 100::/64 子网上的某个目的地传输时,E 可以计算出通往 100::/64 的路由。这是一个反应式控制平面。然而,缓存并不限于反应性控制平面。E 可以计算到 100::/64 的无环路由,但不将此信息安装到其本地 FIB 中。这是FIB压缩的另一种形式,当RIB的大小不受限制,但FIB的大小受到限制时(例如,当存在有限的硬件转发表时),可以使用该形式。
Instead, E must rely on traffic as it passes through the network. For instance, E could calculate a route toward 100::/64 when some packet is transmitted from a locally attached host toward some destination on the 100::/64 subnet. This is a reactive control plane. Caching is not restricted to reactive control planes, however. It is possible for E to calculate a loop-free route to 100::/64, but to not install this information into its local FIB. This is another form of FIB compression, which can be used when the size of the RIB is not limited, but the size of the FIB is (for instance, when there is a limited hardware forwarding table). FIB compression was once quite common in network devices but has generally fallen out of favor as the cost of memory has decreased and other techniques to store more forwarding information in smaller amounts of memory have been developed and deployed.
笔记
Note
在 RIB 到 FIB 缓存方案的网络工程文化中也存在着不好的记忆;在 20 世纪 90 年代末,许多提供商网络因这些方案而失败,因此许多网络工程师都避免使用此类方案,而且通常这样做是正确的。除了“正常”缓存方案中发现的故障模式之外,RIB 到 FIB 缓存方案中还有许多有趣且不可预测的故障模式。
There are also bad memories in the culture of network engineering around RIB to FIB caching schemes; in the late 1990s, many provider networks failed due to these schemes, so many network engineers avoid such schemes—and often rightly so. There are many interesting and unpredictable failure modes in RIB to FIB caching schemes, beyond those found in “normal” caching schemes.
任何缓存方案的关键问题是:缓存的信息应该保存多长时间?这个问题至少有两个答案:
The key question in any caching scheme is: how long should the cached information be held? There are at least two answers to this question:
• 在安装缓存条目后的某个特定时间或在最后一次使用其转发数据包后的某个特定时间删除缓存条目;这是基于计时器的。
• Remove a cache entry some specific time after it has been installed, or some specific time after its last use to forward a packet; this is timer based.
当缓存达到其容量的一定百分比时,删除最旧或最具体的缓存条目;这是基于容量的。
• Remove the oldest or most specific cache entries when the cache reaches some percentage of its capacity; this is capacity based.
通常这些是结合在一起的,第一个是删除陈旧缓存信息的“正常”过程,第二个用作“安全阀”以防止缓存溢出。缓存通常依赖于正在使用的转发表条目的数量是可到达目的地的一小部分。一般来说,经验法则是大约 80/20 — 80% 的流量将定向到 20% 的目的地,或者在其他情况下,大约 20% 的总可到达目的地将需要存储在任何给定的位置。时间。
Normally these are combined, with the first being the “normal” process for removing stale cache information, and the second used as a “safety valve” to prevent the cache from overflowing. Caches normally rely on the number of forwarding table entries in use being some small percentage of the reachable destinations. Generally, the rule of thumb is somewhere around 80/20—80% of the traffic will be directed at 20% of the destinations, or, in other situations, about 20% of the total reachable destinations will need to be stored at any given time.
设计者在以这种方式缓存转发信息时面临许多问题。图 19-11用于说明一种有趣的故障模式。
There are a number of problems designers face when caching forwarding information in this way. Figure 19-11 is used to illustrate one interesting failure mode.
图19-11, E 连接了 100 个主机;同时,C 和 D 可以支持其转发表中的 70 个条目,并且当其转发表已满 80% 时将开始从缓存中删除项目(因此,当缓存达到 56 个条目时,缓存算法开始删除最旧的条目以带来缓存中有一定数量的总条目(在本例中为 50)。假设缓存发生在单个目标 IP 地址级别,而不是子网级别(其原因将在以下示例中解释)。缓存解决方案通常假设的情况是 A 将同时与 100 个可能目的地中的有限数量的目的地进行通信。如果 A 一分钟内与 20 个目标设备建立会话,然后下一分钟又建立 20 个会话,依此类推,
In Figure 19-11, E has 100 hosts attached; at the same time, C and D can support 70 entries in their forwarding table and will start removing items from cache when their forwarding table is 80% full (so when the cache reaches 56 entries, the caching algorithm begins removing the oldest entries to bring the cache under some number of total entries, say 50 for the purposes of this example). Assume caching is taking place at the individual destination IP address level, rather than at the subnet level (the reason for this will be explained in a following example). The situation that caching solutions normally assume is that A will communicate with a limited number of the 100 possible destinations at once. If A builds sessions with 20 of these destination devices for one minute, then another 20 the next minute, and so on, the cache can be “tuned” to carry information about any particular reachable destination for just a few seconds after its last use.
从缓存的角度来看,最糟糕的情况是 A 尝试同时与所有 100 个可到达的主机进行通信,或者缓存计时器设置得足够长导致这些目的地中的每一个始终保留在缓存中。在这种情况下会出现两个问题。首先,B 处的缓存将会溢出。当 B 收到触发第 57个目的地缓存的数据包时,它将开始删除较旧的缓存条目,以防止缓存完全失败。当然,依赖于已删除的缓存条目的流将继续发送数据包(或者可能重置,并再次开始发送数据包),再次导致缓存到达第 57 个条目,因此最旧的条目将再次被删除。这是一个简单的问题,很容易被发现,即使它不容易缓解。
The worst possible case, from a caching perspective, is that A attempts to communicate with all 100 reachable hosts at once, or the cache timers are set long enough to cause every one of these destinations to remain in the cache at all times. Two problems are going to develop in this case. First, the cache at B is going to overflow. When B receives a packet that triggers caching of the 57th destination, it will begin removing older cache entries in order to protect the cache from failing entirely. The flow dependent on the removed cache entries will, of course, continue sending packets (or perhaps reset, and begin sending packets again), again causing the cache to reach the 57th entry, and hence the oldest entries to be removed again. This is a straightforward problem, easily detected, even if it is not easily mitigated.
其次,C和D处的缓存可能会出现问题。如果 B 在 C 和 D 之间完美地分配负载,则可以构建一个稳定的系统。但是,这在现实生活中很少发生。相反,A 处最多可能发生 60/40 的分配;因此,B 发送到 40 个目的地的流量将发送到 C,而 A 发送到其他 60 个目的地的流量将发送到 D。结果是 D 上的缓存溢出(需要 60 个缓存条目,即超过缓存算法允许的 56),导致 D 开始删除缓存条目。删除此缓存信息也会导致会话重置。
Second, the caches at C and D are likely to develop problems. It is possible to build a stable system if B splits the load perfectly between C and D. However, this is rarely going to happen in real life. Instead, what is likely to happen at A is, at best, a 60/40 split; so traffic sent by B toward 40 of the destinations is sent to C, while traffic sent by A toward the other 60 destinations is sent toward D. The result is the cache on D overflows (there would need to be 60 cache entries, which is more than the 56 allowed by the caching algorithm), causing D to start removing cache entries. The removal of this caching information will cause the session to reset, as well.
B、C 和 D 处的缓存搅动很容易发展成正反馈循环,其中丢弃的数据包和会话会导致重构网络中的流量流向,进而导致不同的缓存溢出,进而(再次)导致丢弃数据包和会话重置。除了明显的方法之外,解决此类问题的方法很少:增加缓存大小,或减少通过网络的并发流量数量。
The cache churn at B, C, and D can easily develop into a positive feedback loop, where dropped packets and sessions cause a refactoring of where traffic flows in the network, in turn causing different caches to overflow, in turn (again) causing dropped packets and session resets. There are few ways to resolve this sort of problem other than the obvious ones: increase the cache size, or reduce the number of concurrent flows through the network.
一个显而易见的答案——缓存到子网级别,而不是单个主机——是行不通的。图 19-12用于解释为什么这不起作用。
One apparently obvious answer—caching to the subnet level, rather than individual hosts—will not work. Figure 19-12 is used to explain why this will not work.
图 19-12显示了两个网络:一个(上面的)标记在 之前,另一个(下面的)标记在 之后。假设B、C、D和E缓存到目的地的子网,而不是单个主机信息。这个网络中发生的事情是
Figure 19-12 shows two networks: one (the upper) labeled before and the other (the lower) labeled after. Assume B, C, D, and E cache to the subnet of the destination, rather than the individual host information. What happens in this network is
• A 向2001:db8:3e8:101::1 发送数据包。
• A sends a packet to 2001:db8:3e8:101::1.
• B 收到此数据包并发现(通过某种机制 — 无论该机制是什么)可以通过 C 和 D 到达目的地。
• B receives this packet and discovers (through some mechanism—it does not matter what this mechanism is) that the destination is reachable through C and D.
• B 确定(可能基于负载共享)流量应经过 C;它在其本地转发表中通过 C 构建了一个指向 2001:db8:3e8:100::/60 的缓存条目。
• B determines (perhaps based on load sharing) that the traffic should travel through C; it builds a cache entry toward 2001:db8:3e8:100::/60 through C in its local forwarding table.
• A 现在将数据包发送到2001:db8:3e8:100::1。
• A now sends a packet to 2001:db8:3e8:100::1.
• B 沿着通往 100::/60 的路径转发该流量,因此该流量被发送到 C,然后转发到 E,并在此处被丢弃。
• B forwards this traffic along the path toward 100::/60, so the traffic is sent to C, then forwarded to E, where it is dropped.
为什么 E 会丢弃此流量?发往 100::1 的数据包“存在”于两个不同的网络地址空间:100::/60 和 100::/64。E 了解 100::/60 地址空间,因此它应该了解该空间中的每个可到达的目的地。因为 E 相信它知道该地址空间中的每个目的地,所以 E 没有理由向其任何邻居询问 100:1;它应该已经知道这个特定的目的地。然而,该目的地连接到 D,因此 E 的本地转发表中无法包含 100::1。实际上,E 认为它知道100::1 作为单个主机并不存在,因此它将丢弃发往该地址的任何流量。
Why does E drop this traffic? The packet destined to 100::1 “lives” in two different network address spaces: the 100::/60 and the 100::/64. E knows about the 100::/60 address space, so it should know about every reachable destination in this space. Because E believes it knows about every destination in this address space, there is no reason for E to ask any of its neighbors about 100:1; it should already know about this specific destination. This destination, however, is connected to D, so there is no way for E to have 100::1 in its local forwarding table. In effect, E believes it knows 100::1, as an individual host, does not exist, so it will drop any traffic destined to this address.
因此,A 没有有效的方法来到达连接到 100::/64 网络的任何设备;可能是当(或如果)缓存条目在 B 超时时,下一个数据包将恰好是 100::/64 网络内的目的地,从而导致在 B 构建正确的缓存条目集。无论这种情况是否有可能发生,对于控制平面来说,具有可能的状态从来都不是一件好事,例如这个状态,其中可达性是可变的或不可预测的。
Because of this, A has no effective way to reach any device attached to 100::/64 network; it might be that when (or if) the cache entry times out at B, the next packet will happen to be for a destination within the 100::/64 network, causing the correct set of cache entries to be built at B. Whether or not this is likely to happen, it is never a good thing for control planes to have possible states, such as this one, where reachability is variable or unpredictable.
有多种方法可以解决这个问题,但似乎没有一种方法可以在现实世界中部署。例如,您可以规定网络中的每个前缀必须具有相同的前缀长度,但这会排除聚合,这是有问题的。
There are a number of ways this problem could be fixed, none of which appear to be deployable in the real world. For instance, you could dictate that every prefix in the network must have the same prefix length, but this would rule out aggregation, which is problematic.
在每个设备(或每个地址)级别构建缓存条目还有另一个原因——以改善负载共享。考虑图 19-11所示的示例;如果 B 在每个子网级别构建其缓存,则 B 将选择一条路径(通过 C 或通过 D)来发送网络中的所有流量。另一条路径将保持未使用状态(至少直到 B 的缓存条目超时,此时已使用和未使用的路径可能会切换)。子网级别的缓存可能会导致大量网络资源闲置;一般来说,这被认为是应该避免的结果。
There is another reason to build cache entries at the per device (or per address) level—to improve load sharing. Consider the example shown in Figure 19-11; if B built its cache at the per subnet level, then B would choose one path, either through C or through D, to send all the traffic in the network. The other path would remain unused (at least until B’s cache entry timed out, at which point the used and unused paths might switch). Caching at the subnet level can cause a large set of network resources to go unused; generally this is considered a result to be avoided.
现代世界的每个人都应该知道有时放慢速度的价值——它可以减少信息过载。对于控制平面来说也是如此。减慢向设备呈现信息的速度并不能真正减少处理和内存需求,而是随着时间的推移将它们分散开来。支持减慢状态速度的另一点是,它可以允许将多个状态更改“聚集”或“捆绑”到单个处理周期中。图 19-13说明了这些概念。
Everyone in the modern world should know the value of slowing down sometimes—it can reduce information overload. It is no different for a control plane; slowing down the pace at which information is presented to a device does not really reduce the processing and memory requirements so much as spread them out over time. Another point in favor of slowing down state velocity is that it can allow multiple state changes to be “gathered,” or “bunched,” into a single processing cycle. Figure 19-13 illustrates these concepts.
图19-13中,时间线1说明了F与另一路由器之间的链路发生故障的实际顺序;[A,F] 和 [B,F] 的故障距离相对较近,其余链路的故障距离较远(或时间上分散)。在时间线 2 中,F 等待固定的时间来通告控制平面状态更改。由于事件发生和报告事件之间存在延迟,因此会同时或在同一更新中报告 [A,F] 和 [B,F] 链路的故障。这允许 G 同时处理两个事件,这(应该)需要更少的处理器和内存资源。
In Figure 19-13, timeline 1 illustrates the actual order in which the links between F and another router fail; [A,F] and [B,F] fail relatively close to one another, and the remaining links fail a bit farther apart (or spread out in time). In timeline 2, F waits to advertise the control plane state change for a fixed amount of time. Because of this delay between the event occurring and reporting the event, the failures of the [A,F] and [B,F] links are reported at the same time, or in the same update. This allows G to process both events at the same time, which (should) require less processor and memory resources.
最后,在时间线 3 中,显示了指数退避计时器。本质上,事件第一次发生时,会设置一个计时器,并在计时器到期后报告该事件。在时间线 3 中,此计时器设置为 0 秒,因此立即报告事件(指数退避的常见配置)。一旦报告了事件,就会设置一个单独的计时器,该计时器必须在报告下一个事件之前到期(或唤醒)。此后发生的每个事件都会以指数方式增加该计时器,导致事件报告的时间不断增加。
Finally, in timeline 3, an exponential backoff timer is shown. Essentially, the first time an event occurs, a timer is set, and the event is reported after the timer has expired. In timeline 3, this timer is set to 0 seconds, so the event is reported immediately (a common configuration for exponential backoffs). Once the event has been reported, a separate timer is set that must expire (or wake up) before the next event can be reported. Each event occurring after this increases this timer exponentially, causing the reporting of events to be spread out over ever-increasing amounts of time.
隐藏信息有几个积极的影响:
Hiding information has several positive effects:
• 它通过限制必须对拓扑或可达性的任何特定变化做出反应的设备范围,将网络划分为多个故障域。
• It breaks a network into failure domains by limiting the scope of devices that must react to any particular change in topology or reachability.
• 它降低了控制平面状态的速度和范围,允许网络扩展到更大的规模,同时保持网络稳定性。
• It reduces the velocity and scope of control plane state, allowing network to scale to larger sizes while retaining network stability.
• 它是一个“钩子”,通过它来实施政策,特别是与网络安全相关的政策。
• It is a “hook” through which to implement policy, specifically in relation to network security.
基于这些优势,隐藏更多状态似乎总是更好。然而,与网络工程中的所有事情一样,事实更接近于权衡。如果您还没有找到权衡,那么您还没有仔细考虑。在信息隐藏的情况下,请参阅第 1 章“基本概念”,特别是有关复杂性的部分,以及给出的有关拉伸和路由聚合的示例。隐藏状态的第二个实例可以与微循环相关,这在第 13 章“单播无循环路径 (2) ”中进行了解释。状态速度减慢得越多,网络中此类微循环存在的时间就越长。
It might seem hiding more state is always better, based on these advantages. However, as with all things in network engineering, the truth is closer to a tradeoff. If you have not found the tradeoff, you have not looked hard enough. In the case of information hiding, refer back to Chapter 1, “Fundamental Concepts,” specifically the section on complexity, and the example given concerning stretch and route aggregation. A second instance of hiding state can be found in relation to micro-loops, which are explained in Chapter 13, “Unicast Loop-Free Paths (2).” The more you slow down the velocity of state, the longer such microloops will exist in the network.
因此,隐藏状态对于优秀设计者来说是一个有用的工具,但它也会对网络性能产生负面影响,从而导致许多问题。
Hiding state is, then, a useful tool in the hands of good designers, but it can also cause many problems by negatively impacting network performance.
Bollapragada、维杰、拉斯·怀特和柯蒂斯·墨菲。Cisco IOS 软件架构内部。印第安纳州印第安纳波利斯:思科出版社,2000 年。
Bollapragada, Vijay, Russ White, and Curtis Murphy. Inside Cisco IOS Software Architecture. Indianapolis, IN: Cisco Press, 2000.
多伊尔、杰夫和詹妮弗·德黑文·卡罗尔。路由 TCP/IP,第 1 卷。第二版。印度新德里:思科出版社,2005 年。
Doyle, Jeff, and Jennifer DeHaven Carroll. Routing TCP/IP, Volume 1. 2nd edition. New Delhi, India: Cisco Press, 2005.
斯特林菲尔德、纳基亚、拉斯·怀特和斯塔西娅·麦基。思科快速转发。第一版。印第安纳州印第安纳波利斯:思科出版社,2007 年。
Stringfield, Nakia, Russ White, and Stacia McKee. Cisco Express Forwarding. 1st edition. Indianapolis, IN: Cisco Press, 2007.
特谢拉、雷娜塔、阿曼·谢赫、蒂莫西·G·格里芬和詹妮弗·雷克斯福德。“IP 网络中热点路由变化的影响。” IEEE/ACM 网络交易,16,编号。6(2008 年 12 月):1295-307。doi:10.1109/TNET.2008.919333。
Teixeira, Renata, Aman Shaikh, Timothy G. Griffin, and Jennifer Rexford. “Impact of Hot-Potato Routing Changes in IP Networks.” IEEE/ACM Transactions on Networking, 16, no. 6 (December 2008): 1295–307. doi:10.1109/TNET.2008.919333.
怀特、拉斯和丹尼斯·多诺霍。网络架构的艺术:业务驱动的设计。第一版。印第安纳州印第安纳波利斯:思科出版社,2014 年。
White, Russ, and Denise Donohue. The Art of Network Architecture: Business-Driven Design. 1st edition. Indianapolis, IN: Cisco Press, 2014.
怀特、拉斯和阿尔瓦罗·雷塔纳。IS-IS:IP 网络中的部署。第一版。马萨诸塞州波士顿:艾迪生韦斯利,2003 年。
White, Russ, and Alvaro Retana. IS-IS: Deployment in IP Networks. 1st edition. Boston, MA: Addison-Wesley, 2003.
怀特、拉斯、阿尔瓦罗·雷塔纳和唐·斯莱斯。最优路由设计。第一版。印第安纳州印第安纳波利斯:思科出版社,2005 年。
White, Russ, Alvaro Retana, and Don Slice. Optimal Routing Design. 1st edition. Indianapolis, IN: Cisco Press, 2005.
怀特、拉斯和杰夫·坦苏拉。应对网络复杂性:利用 SDN、服务虚拟化和服务链的下一代路由。印第安纳州印第安纳波利斯:Addison-Wesley Professional,2015。
White, Russ, and Jeff Tantsura. Navigating Network Complexity: Next-Generation Routing with SDN, Service Virtualization, and Service Chaining. Indianapolis, IN: Addison-Wesley Professional, 2015.
怀特、拉塞尔·I.、史蒂文·爱德华·摩尔、詹姆斯·L·吴和阿尔瓦罗·恩里克·雷塔纳。美国专利:8121130 — 在反应式路由环境中确定最佳路由通告。8121130,2012 年 2 月 21 日发布。http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=8,121,130。PN.&OS=PN/8,121,130&RS=PN/8,121,130。
White, Russell I., Steven Edward Moore, James L. Ng, and Alvaro Enrique Retana. United States Patent: 8121130—Determining an optimal route advertisement in a reactive routing environment. 8121130, issued February 21, 2012. http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=8,121,130. PN.&OS=PN/8,121,130&RS=PN/8,121,130.
———。美国专利:9191227—在反应式路由环境中确定路由通告。9191227,2015 年 11 月 17 日发布。http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=9191227。PN.&OS=PN/9191227&RS=PN/9191227。
———. United States Patent: 9191227—Determining a route advertisement in a reactive routing environment. 9191227, issued November 17, 2015. http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=9191227. PN.&OS=PN/9191227&RS=PN/9191227.
1. 描述如何确定控制平面状态的范围和速度。
1. Describe how you can determine the scope and speed of control plane state.
2. 是否有可能通过负反馈循环导致网络故障?如果是这样,怎么办?
2. Is it possible to cause a network failure through a negative feedback loop? If so, how?
3. Describe the difference between summarizing and aggregating control plane information to control state.
4. 考虑状态/优化/表面 (SOS) 模型。您如何描述该模型中的汇总和聚合?
4. Consider the State/Optimization/Surface (SOS) model. How would you describe summarization and aggregation within this model?
5. 考虑状态/优化/表面模型 (SOS)。您如何描述通过该模型中的分层控制平面隐藏信息?
5. Consider the State/Optimization/Surface model (SOS). How would you describe hiding information through layered control planes within this model?
6. 在网络控制平面中形成正反馈环路可能存在哪些限制因素?
6. What might be some limiting factors in the formation of a positive feedback loop in a network control plane?
7. 研究有类互联网协议寻址。超网/子网概念如何比类寻址方案更“整齐”地适合这种方案?
7. Research classful Internet Protocol addressing. How might the supernet/subnet concepts fit more “neatly” into this kind of scheme than they do with classes addressing schemes?
8. 研究“延伸阅读”部分中列出的两项有关反应式控制平面和可达性信息缓存的专利。描述这些专利针对文中描述的问题提出的解决方案。
8. Research the two patents listed in the “Further Reading” section around reactive control planes and the caching of reachability information. Describe the solution presented in these patents to the problem described in the text.
9. 描述将网络分解为故障域的重要性。
9. Describe the importance of breaking up a network into failure domains.
10. 描述信息隐藏和故障域之间的关系。
10. Describe the relationship between information hiding and failure domains.
前一章考虑了信息隐藏旨在防止或解决的问题,包括正反馈循环,作为整体安全模式的一部分,并减少控制平面中状态的数量和速度。本章将提供几个在网络和协议中部署信息隐藏的示例。第一部分将描述总结中间系统到中间系统(IS-IS)和开放最短路径优先(OSPF);正如前面章节中所描述的,这里的汇总意味着去除拓扑信息,而不是可达性信息。第二部分将描述边界网关协议 (BGP) 中的聚合,该协议最初包括在聚合路由中保留拓扑信息的有趣功能。接下来将考虑分层的示例,特别是 BGP 中承载的外部路由在 IS-IS 中承载的内部路由上分层。最后一节将详细描述指数退避,应用于 BGP 邻居抑制和 IS-IS 中控制平面状态的分布以及无环路路径的计算。
The preceding chapter considered the problems that information hiding is designed to prevent or resolve, including positive feedback loops, as part of an overall pattern of security, and to reduce the amount and velocity of state in a control plane. This chapter will provide several examples of information hiding deployed in networks and protocols. The first section here will describe summarization in Intermediate System to Intermediate System (IS-IS) and Open Shortest Path First (OSPF); as described in earlier chapters, summarization here means the removal of topology information, rather than reachability information. The second section will describe aggregation in the Border Gateway Protocol (BGP), which originally included the interesting feature of preserving topology information within aggregated routes. An example of layering will be considered next, specifically external routes carried in BGP layered over internal routes carried in IS-IS. The final section will describe exponential backoff in some detail, as applied to BGP neighbor dampening and the distribution of control plane state and calculation of loop-free paths in IS-IS.
阅读本节时,请记住,汇总是删除拓扑信息以管理控制平面状态,聚合是删除可达性信息以管理控制平面状态。本节将考虑链路状态协议上下文中的总结,因为距离矢量协议会删除网络中每一跳的拓扑信息,甚至是保留小半径拓扑信息的增强型内部网关路由协议 (EIGRP)。
When reading this section, remember that summarization is removing topology information to manage control plane state, and aggregation is removing reachability information to manage control plane state. This section will consider summarization in the context of link state protocols, as distance vector protocols remove topology information at every hop in the network—even the Enhanced Interior Gateway Routing Protocol (EIGRP), which keeps a small radius of topology information.
IS-IS 是许多大型网络中使用的链路状态协议,包括传输提供商和数据中心(云)结构。链路状态控制平面中承载的状态量和速度可能会压倒具有较小内存量的较慢处理器,例如可能用于低成本路由器和交换机或必须适合有限物理空间的设备。考虑图 20-1所示的网络。
IS-IS is a link state protocol used in many large-scale networks, including transit providers and data center (cloud) fabrics. The amount and velocity of state carried in a link state control plane can overwhelm slower processors with smaller amounts of memory, such as might be used in lower-cost routers and switches, or devices that must fit into limited physical spaces. Consider the network illustrated in Figure 20-1.
在网络发展的早期,路由器的计算能力和存储空间通常比撰写本文时要少得多。即使今天为小型办公室和家庭使用部署的路由器也可以比早期网络中使用的中端路由器拥有更多的可用资源,而今天部署的“低端”路由器通常比几年前功能最强大的路由器拥有更多的可用资源。在考虑协议设计和部署时,特别是在汇总和聚合领域,考虑这一点始终很重要。
In the early years of networking, routers were often shipped with much less computing power and storage than are available at the time of this writing. Even the routers deployed today for small office and home use can have more available resources than midrange routers used in early networks, and “low-end” routers deployed today often have more available resources than the most capable routers just a few years ago. It is always important to consider this when looking at protocol design and deployment, particularly in the area of summarization and aggregation.
在图20-1中,如果100::/64的状态发生变化,D和E将收到A生成的链路状态包(LSP)的三个或四个不同副本(取决于哪个LSP到达哪个中间系统,或者那时候)。此外,如果 [A,B] 链路发生故障,F 将收到有关此拓扑更改的更新,即使它对 E 到达 100::/64 的能力没有影响。有多种方法可以减少该网络中的控制平面状态;本节考虑每个链路状态协议中包含的一种方法:洪泛域。
In Figure 20-1, if the status of 100::/64 changes, D and E will receive three or four different copies of the Link State Packet (LSP) generated by A (depending on which LSP arrives at which intermediate system, or IS, when). Further, if the [A,B] link fails, F will receive an update about this topology change, even though it has no impact on E’s ability to reach 100::/64. There are a number of ways to reduce the control plane state in this network; this section considers one method included in every link state protocol: flooding domains.
泛洪域是一组具有完全同步数据库的路由器(IS-IS 中的中间系统)。当跨越泛洪域边界时,将汇总拓扑信息,并且可以聚合可达性信息。
A flooding domain is a set of routers (intermediate systems in the case of IS-IS) with completely synchronized databases. When the flooding domain boundary is crossed, topology information will be summarized, and reachability information may be aggregated.
要了解洪泛域,最好先快速回顾一下 IS-IS 中使用的开放系统互连 (OSI) 寻址。图 20-2将有助于说明该寻址方案。
To understand flooding domains, it is best to start with a quick review of Open Systems Interconnect (OSI) addressing, which is used in IS-IS. Figure 20-2 will help illustrate this addressing scheme.
图 20-2中需要考虑 OSI 地址的两个主要部分。点之间地址的右侧三个部分对于每个 IS 都是唯一的,并且通常是根据本地物理(媒体访问控制或 MAC)地址计算的(因为这些地址几乎总是设计为对于每个物理接口或部件都是唯一的)硬件)。左侧部分的长度可变(尽管几乎总是如互联网协议或 IP 网络中所示使用),被 IS 视为“区域 ID”。任何两个中间系统在其 OSI 地址的右侧三个点部分(称为区域标识符或区域 ID)左侧具有相同的信息,都被视为同一 1 级泛洪域的一部分,并将形成 1 级邻接。同样,任何两个在其 OSI 地址的左侧部分具有不同信息的中间系统将形成2 级邻接,因此将被视为 2 级洪水域的一部分。一个IS-IS网络中可以有多个不同的1级泛洪域,但只能有1个2级泛洪域。
There are two primary sections of the OSI address to consider in Figure 20-2. The right three parts of the address between the dots are unique to each IS, and are generally calculated based on a local physical (Media Access Control, or MAC) address (as these are almost always designed to be unique for each physical interface or piece of hardware). The left sections, which are variable in length (although almost always used as shown in Internet Protocol, or IP, networks), are considered the “area ID” by the IS. Any two intermediate systems with the same information to the left of the right three dotted sections of their OSI address (called the area identifier, or area ID), are considered part of the same level 1 flooding domain, and will form a level 1 adjacency. Likewise, any two intermediate systems with different information in the left sections of their OSI addresses will form a level 2 adjacency, and hence will be considered part of the level 2 flooding domain. There may be many different level 1 flooding domains in an IS-IS network, but there can be only one level 2 flooding domain.
无论区域标识符如何,IS-IS 在每对中间系统之间形成 2 级邻接关系。然而,为了简化洪泛域的解释,本文将假设能够形成 1 级邻接的邻居将仅形成1 级邻接。本章稍后将对此进行重新讨论,以提供有关这一点的更多详细信息。
IS-IS forms level 2 adjacencies between each pair of intermediate systems, regardless of the area identifier. To simplify the explanation of flooding domains, however, this text will assume that neighbors capable of forming level 1 adjacencies will form only level 1 adjacencies. This will be revisited later in this chapter to provide more detail on this point.
图 20-3说明了这一点。
Figure 20-3 illustrates.
图20-3中,A与B、C的区域ID不同;A 的值为 49.0011.2222,而 B 和 C 的值为 49.0011.5555。因此,[A,B] 和 [A,C] 将是 2 级邻接。[B,C]、[B,D]、[C,E]、[B,E]、[C,D]、[D,F] 和 [E,F] 都将是 1 级邻接。每种邻接类型将仅交换与正确邻接级别关联的数据库;如图 20-4所示。
In Figure 20-3, A has a different area ID than B and C; A has 49.0011.2222, while B and C have 49.0011.5555. Hence, [A,B] and [A,C] will be level 2 adjacencies. [B,C], [B,D], [C,E], [B,E], [C,D], [D,F], and [E,F] will all be level 1 adjacencies. Each adjacency type will exchange only the database associated with the correct adjacency level; Figure 20-4 illustrates.
图20-4中,A、B、C都有一个同步(共享)的二级数据库;B、C、D、E 和 F 都有同步(共享)1 级数据库。
In Figure 20-4, A, B, and C all have a synchronized (shared) level 2 database; B, C, D, E, and F all have a synchronized (shared) level 1 database.
笔记
Note
准确地说,每个IS都建立并维护一个一级数据库和一个二级数据库。例如,如果您要检查 F,您会发现它有一个 1 级数据库,其中包含有关 49.0011.5555 洪泛域中的每个链路、节点 (IS) 和可到达目的地的信息。然而,F 处的 2 级泛洪域将包含 F 本身的单个条目。为什么 F 处的 2 级数据库只有一个条目?F仅与D和E建立1级邻接关系;它不会跨 1 级邻接同步 2 级数据库中的任何内容。因此,它构建了一个 2 级数据库,但不会与任何相邻邻居共享(同步)该数据库的内容。
To be precise, every IS builds and maintains both a level 1 and a level 2 database. If you were to examine F, for instance, you would find it has a level 1 database containing information about every link, node (IS), and reachable destination in the 49.0011.5555 flooding domain. The level 2 flooding domain at F, however, would contain a single entry, for F itself. Why does the level 2 database at F have a single entry? F only builds level 1 adjacencies with D and E; it will not synchronize anything in its level 2 database across a level 1 adjacency. Hence, it builds a level 2 database but will not share the contents (synchronize) the contents of this database with any adjacent neighbors.
这种安排显然减少了网络中可达性和拓扑信息的范围;由于 100::/64 的信息位于 A 的 2 级洪泛数据库中,因此它只会在 2 级邻接之间共享;因此只有 B 和 C 会收到此信息。但是连接到 F(例如 101::/64)的主机如何到达这个目的地呢?F 没有收到 2 级数据库的副本,因此无法了解 100::/64。
This arrangement certainly seems to reduce the scope of the reachability and topology information in the network; as the information about 100::/64 is in the level 2 flooding database at A, it will be shared just across level 2 adjacencies; hence only B and C will receive this information. But how can a host connected to F (at 101::/64, for instance) reach this destination? F does not receive a copy of the level 2 database, and therefore cannot know about 100::/64.
IS-IS 通过附加位解决了这个问题。B 和 C,因为它们附加到 2 级泛洪域(请记住,IS-IS 网络中只能有一个 2 级泛洪域),因此将在其通告中设置附加位。这会导致 D、E 和 F 在其本地路由表中创建一条指向 B 和 C 的默认路由。然后,源自 101::/64 上某个位置的流量将根据 F 处的此默认路由切换到任一 D当流量到达B或C时,将根据二级数据库中包含的信息,沿着本地路由表中安装的特定路由到达目的地。
IS-IS solves this through the attached bit. B and C, because they are attached to the level 2 flooding domain (remember there can be only one level 2 flooding domain in an IS-IS network), will set the attached bit in their advertisements. This causes D, E, and F to create a default route in their local routing tables to point toward B and C. Traffic originating someplace on 101::/64, then, will be switched based on this default route at F toward either D or E, and then toward B or C from D or E. When the traffic reaches B or C, it will follow the specific route installed in the local routing table based on the information contained in the level 2 database toward the destination.
回程流量又如何呢?如果 101::/64 通过 1 级泛洪数据库共享给 B、C、D、E 和 F,则 A 无法知道该目的地。因此,连接到 100::/64 网络的主机将无法向 101::/64 网络上的主机发送流量。IS-IS通过两个数据库之间的重新分配来解决这个问题。1 级泛洪数据库中的任何目的地都会自动重新分配到 2 级泛洪数据库中,就像它们附加到重新分配 IS 一样。从重新分发点到达目的地的成本保留在注入第 2 层的新路由中,以通过网络提供(某种程度的)最佳路由。
What about the return traffic? If 101::/64 is shared to B, C, D, E, and F through a level 1 flooding database, A cannot know about this destination. Hence, hosts attached to the 100::/64 network will not be able to send traffic toward a host on the 101::/64 network. IS-IS solves this problem through redistribution between the two databases. Any destinations in the level 1 flooding database are automatically redistributed into the level 2 flooding database as if they are attached to the redistributing IS. The cost to reach the destination from the redistribution point is preserved in the new route injected into level 2 to provide (some level of) optimal routing through the network.
如果您想将通往 100::/64 的更具体的路由传送到图 20-4所示网络中的 1 级泛洪域中,该怎么办?大多数 IS-IS 实现通过路由泄漏从 2 级数据库重新分配到 1 级数据库来实现这一点。为了防止路由环路(考虑如果将 100::/64 从 2 级数据库重新分配到 1 级数据库并再返回会发生什么情况),从 2 级泛洪域重新分配到 1 级泛洪域的路由具有 down 位放; 这意味着该路由已在泛洪域层次结构中重新分配,并且不应在层次结构中重新分配。
What if you want to carry a more specific route toward 100::/64 into the level 1 flooding domain in the network shown in Figure 20-4? Most IS-IS implementations allow this through redistribution from the level 2 database into the level 1 database through route leaking. To prevent routing loops (consider what would happen if 100::/64 were redistributed from the level 2 database into the level 1 database and back again), routes redistributed from the level 2 flooding domain into a level 1 flooding domain have the down bit set; this means the route has been redistributed down the flooding domain hierarchy and should not be redistributed back up the hierarchical level structure.
然后,IS-IS 在洪泛域边界处聚合路由信息和汇总拓扑信息。通过聚合(附加位,这会导致在1级洪泛域中的每个IS的本地路由表中安装一条::/0路由),有可能泄漏更具体的可达性信息,因此路由聚合可以是“撤消”,如果网络设计者认为这样做很重要。
IS-IS, then, both aggregates routing information and summarized topology information at a flooding domain boundary. It is possible to leak more specific reachability information through the aggregate (the attached bit, which causes a ::/0 route to be installed in the local routing table of each IS in the level 1 flooding domain), so route aggregation can be “undone,” if the network designer decides it is important to do so.
关于 IS-IS 洪泛域,需要注意的一个有趣的点是:1 级和 2 级洪泛域之间不存在任何类型的“硬边界”。对于网络中的每个 IS 来说,作为 2 级泛洪域以及某些 1 级泛洪域的一部分是完全有效的。图20-5所示的网络中,一些中间系统同时处于1级和2级洪泛域中。
An interesting point to note about IS-IS flooding domains is this: there are no “hard boundaries” of any kind between the level 1 and level 2 flooding domains. It is perfectly valid for every IS in a network to be a part of the level 2 flooding domain, as well as some level 1 flooding domain. Figure 20-5 illustrates a network in which some intermediate systems are in both level 1 and level 2 flooding domains.
四个泛洪域如图20-5所示。第一个域 49.0011.1111 包含 A、B 和 C。第二个域 49.0011.3333 包含 D 和 F。第三个域 49.0011.4444 包含 G、H、M 和 N。第四个泛洪域是覆盖2级泛洪域,其中包含C、E、D、G和F。这里第一个有趣的点是,除了E之外,2级泛洪域中的所有中间系统也都在1级泛洪域中。每个1 级和 2 级洪泛域中的中间系统都与其在 2 级洪泛域中的每个连接的邻居形成了 2 级邻接关系,并且是将其 2 级数据库与其 2 级邻居同步。例如,C 有四个邻接:
Four flooding domains are illustrated in Figure 20-5. The first, 49.0011.1111, contains A, B, and C. The second, 49.0011.3333, contains D and F. The third, 49.0011.4444, contains G, H, M, and N. The fourth flooding domain is the overlaying level 2 flooding domain, which contains C, E, D, G, and F. The first interesting point here is all the intermediate systems in the level 2 flooding domain are also in a level 1 flooding domain with the exception of E. Each of the intermediate systems in both a level 1 and level 2 flooding domain have formed a level 2 adjacency with each of its connected neighbors in the level 2 flooding domain, and are synchronizing their level 2 database with their level 2 neighbors. For instance, C has four adjacencies:
• A,仅与 1 级数据库同步
• A, with which it is synchronizing only the level 1 database
• B,仅与 1 级数据库同步
• B, with which it is synchronizing only the level 1 database
• D,仅与 2 级数据库同步
• D, with which it is synchronizing only the level 2 database
• E,仅与 2 级数据库同步
• E, with which it is synchronizing only the level 2 database
第二个有趣的点是,D 和 F 位于 1 级泛洪域 49.0011.3333 中,与 2 级泛洪域完全重叠。D 和 F 已在它们之间的链路上形成了 1 级和 2 级邻接,并且正在同步 2 级数据库和 49.0011.3333 数据库。两个相邻的具有相同区域ID的中间系统有可能形成2级邻接关系;两个相邻的具有不同区域ID的中间系统不可能形成1级邻接关系。
The second interesting point is that D and F are in a level 1 flooding domain, 49.0011.3333, completely overlapped by the level 2 flooding domain. D and F have formed both a level 1 and level 2 adjacency across the link between them, and are synchronizing both the level 2 database and the 49.0011.3333 database. It is possible for two adjacent intermediate systems with the same area ID to form a level 2 adjacency; it is not possible for two adjacent intermediate systems with different area IDs to form a level 1 adjacency.
OSPF 也是一种链路状态协议,因此也受到与 IS-IS 相同的限制;快速变化的拓扑有时会压倒较慢的拓扑具有较小内存量的处理器。为了防止这种情况,网络设计者可以将 OSPF 网络中的洪泛域分解为多个区域。OSPF区域的实现方式与IS-IS中的泛洪域不同;图 20-6说明了这一点。
OSPF is also a link state protocol, and hence also subject to the same sorts of limitations as IS-IS; a rapidly changing topology can sometimes overwhelm slower processors with smaller amounts of memory. To prevent this, network designers can break up the flooding domains in an OSPF network into areas. OSPF areas are implemented differently from the flooding domains in IS-IS; Figure 20-6 illustrates.
虽然某些机制相似,但两者之间存在一些重要差异。
While some of the mechanisms are similar, there are some important differences between the two.
首先,OSPF区域不能重叠;区域边界路由器 (ABR) 将两个泛洪域(或区域)连接在一起,并具有两个数据库(每个区域一个)。网络中的每一其他路由器都有一个链路状态数据库 (LSDB),其中包含可达性和拓扑信息。外围区域ID可以识别特定路由器位于哪个区域;区域 0充当将所有外围区域连接在一起的中心区域。
First, OSPF areas cannot overlap; Area Border Routers (ABRs) connect two flooding domains (or areas) together and have two databases (one per area). Every other router in the network has one link state database (LSDB), which contains reachability and topology information. The outlying area IDs can identify which area a particular router is in; area 0 acts as a centralized area connecting all of the outlying areas together.
其次,OSPF默认聚合,但默认不聚合。OSPF 以一系列链路状态通告 (LSA) 类型承载信息,每种类型承载不同类型的信息。最常见的类型是
Second, OSPF summarizes by default, but it does not aggregate by default. OSPF carries information in a series of Link State Advertisement (LSA) types, each type carrying a different kind of information. The most common types are
•路由器:有关始发路由器、连接的邻居和连接的目的地的信息
• Router: Information about the originating router, connected neighbors, and connected destinations
•网络LSA:伪节点
• Network LSA: A pseudonode
•区域间前缀(或 摘要)LSA:摘要可达性信息
• Inter-Area Prefix (or summary) LSA: Summarized reachability information
•区域间路由器LSA:有关始发路由器的信息
• Inter-Area Router LSA: Information about the originating router
• AS-External LSA:外部可达性信息
• AS-External LSA: External reachability information
• AS-外部非末节区域 (NSSA) LSA:外部可达性信息
• AS-External Not-So-Stubby Area (NSSA) LSA: External reachability information
其中许多 LSA 类型用于在泛洪域之间传送信息;图 20-7说明了这一点。
Many of these LSA types are used to carry information between flooding domains; Figure 20-7 illustrates.
两个区域之间携带的 LSA 集取决于外围(非区域 0)区域的类型。以下各节将介绍其中的一些内容。
The set of LSAs carried between two areas depends on the type of the outlying (nonarea 0) area. Several of these are described in the following sections.
默认情况下,位于普通区域和区域 0 之间的区域边界路由器 (ABR) 将汇总拓扑信息,但不会聚合路由信息。最好从 B 知道的有关区域 1 的信息开始,然后检查 B 将向区域 0 中的 A 发送哪些信息。在该网络中,B 在其区域 1 中将拥有 LSDB:
An Area Border Router (ABR) sitting between a normal area and area 0 will summarize topology information by default, but not aggregate routing information. It is best to begin with the information B knows about area 1, and then examine what information B would send toward A, in area 0. In this network, B would have in its area 1 LSDB:
• 由 D 发起的 100::/64 的 AS 外部 LSA
• An AS-external LSA for 100::/64 originated by D
• [C,D] 广播链路的网络 LSA(伪节点),其中包括到 C、D 和 101::/64 的连接
• A network LSA (pseudonode) for the [C,D] broadcast link, which would include a connection to C, D, and 101::/64
• 来自 D 的路由器 LSA,具有到 [C,D] 网络 LSA(伪节点)的连接、到 C 的连接以及到 101::/64 的连接
• A router LSA from D with a connection to the [C,D] network LSA (pseudonode), a connection to C, and a connection to 101::/64
• 来自 C 的路由器 LSA,具有到 [C,D] 网络 LSA(伪节点)的连接、到 C 的连接、到 101::/64 的连接、到 B 的连接以及到 102::/64 的连接
• A router LSA from C with a connection to the [C,D] network LSA (pseudonode), a connection to C, a connection to 101::/64, a connection to B, and a connection to 102::/64
• 来自 B 的路由器 LSA,具有到 C 的连接以及到 102::/64 的连接
• A router LSA from B with a connection to C and a connection to 102::/64
If no aggregation is manually configured, A would receive the following information about area 1:
• 由D 发起的100::/64 的AS 外部LSA。正常OSPF 区域中的ABR 不会以任何方式修改外部LSA。
• An AS-external LSA for 100::/64 originated by D. External LSAs are not modified in any way by the ABR in a normal OSPF area.
• 源自 B 的摘要 LSA,包括 101::64 和 102::/64。从最短路径树的角度来看,这两条路由将显示为连接到 B 本身,B 到达每个目的地的成本保存在摘要LSA。
• A summary LSA including 101::64 and 102::/64 originated at B. From the perspective of the Shortest Path Tree, these two routes will appear to be connected to B itself, with B’s cost to reach each destination preserved in the summary LSA.
在OSPF实现中可以手动配置路由聚合;例如,图 20-7中 OSPF 区域 1 中的 100::/64、101::/64 和 102::/64 可到达的目的地可以聚合为 100::/60。在这种情况下,B 将在其汇总 LSA 中向 A 通告一条路由,即 100::/60。
Route aggregation can be manually configured in OSPF implementations; for instance, the 100::/64, 101::/64, and 102::/64 reachable destinations in OSPF area 1 in Figure 20-7 could be aggregated into a 100::/60. In this case, B would advertise a single route in its summary LSA toward A, 100::/60.
OSPF 末节区域旨在支持没有外部可达性的边远区域;所有拓扑和可达性信息都是 OSPF 内部的。因此,OSPF Stub 区域中的路由器不允许将重分配的路由信息携带到外围区域;同样,来自网络其余部分的外部路由信息不会被带入存根区域。因此,D 将无法将 100::/64 路由重新分配到区域 1。在此网络中,B 在其区域 1 中将拥有 LSDB:
OSPF stub areas are designed to support outlying areas with no external reachability; all topology and reachability information is internal to OSPF. Because of this, routers in OSPF stub areas are not allowed to carry redistributed routing information into the outlying area; likewise, external routing information from the rest of the network is not carried into the stub area. Because of this, D would not be able to redistribute the 100::/64 route into area 1. In this network, B would have in its area 1 LSDB:
• [C,D] 广播链路的网络 LSA(伪节点),其中包括到 C、D 和 101::/64 的连接
• A network LSA (pseudonode) for the [C,D] broadcast link, which would include a connection to C, D, and 101::/64
• 来自 D 的路由器 LSA,具有到 [C,D] 网络 LSA(伪节点)的连接、到 C 的连接以及到 101::/64 的连接
• A router LSA from D with a connection to the [C,D] network LSA (pseudo-node), a connection to C, and a connection to 101::/64
• 来自 C 的路由器 LSA,具有到 [C,D] 网络 LSA(伪节点)的连接、到 C 的连接、到 101::/64 的连接、到 B 的连接以及到 102:: 的连接。 /64
• A router LSA from C with a connection to the [C,D] network LSA (pseudo-node), a connection to C, a connection to 101::/64, a connection to B, and a connection to 102::/64
• 来自 B 的路由器 LSA,具有到 C 的连接以及到 102::/64 的连接
• A router LSA from B with a connection to C and a connection to 102::/64
如果没有手动配置聚合,A将收到有关区域1的以下信息:
If no aggregation is manually configured, A would receive the following information about area 1:
一条摘要LSA,包括101::/64和102::/64,源自B。从最短路径树的角度来看,这两条路由似乎连接到B本身,B到达每个目的地的成本保存在摘要LSA。
A summary LSA including 101::/64 and 102::/64 originated at B. From the perspective of the Shortest Path Tree, these two routes will appear to be connected to B itself, with B’s cost to reach each destination preserved in the summary LSA.
ABR(图 20-7中的 B)还将向外围泛洪域(本例中为区域 1)传输一条默认路由 (::/0),因此 C 和 D 可以到达连接到其他区域的任何外部目的地。网络。B、C 和 D 仍然知道 [A,B] 链路,因为这是内部路由信息。
The ABR, B in Figure 20-7, will also transmit a default route (::/0) into the outlying flooding domain (area 1, in this case), so C and D can reach any external destinations connected to other areas in the network. B, C, and D would still know about the [A,B] link, as this is internal routing information.
如果某个区域被配置为完全末节区域,则该区域内的路由器无法将外部路由信息发起(重新分发)到 OSPF 中,并且外部路由也不会被携带到该区域中。在这种情况下,B 处的 LSDB 将与末节区域的情况相同,并且 B 向 A 通告的 LSA 也将与末节区域的情况相同。末节区域和完全末节区域之间的主要区别在于处理从区域 0 到区域 1 的可达性信息。在完全末节区域中,ABR(图 20-7中的网络中的 B )将生成汇总 LSA仅使用默认路由 (::/0) 进入区域 1。C 和 D 不知道区域 0 之外的拓扑或可到达目的地,例如 A 或 [A,B] 链路的存在。
If an area is configured as a totally stubby area, routers within the area cannot originate (redistribute) external routing information into OSPF, and external routes are not carried into the area. In this case, the LSDB at B would be the same as in the stub area case, and the LSAs advertised by B toward A would also be the same as the stub area case. The primary difference between a stub area and a totally stubby area is the handling of reachability information into area 1 from area 0. In a totally stubby area, the ABR (B, in the network in Figure 20-7) will generate a summary LSA into area 1 with just a default route (::/0). C and D would have no knowledge of the topology or the reachable destinations beyond area 0, such as the existence of A or the [A,B] link.
NSSA 中的路由器可以将其他来源(例如其他路由协议或静态配置的路由)的路由信息重新分发到网络中,但来自其他 OSPF 路由器(区域 0 中)的外部路由在 ABR 处被阻止。由于泛洪域内不允许存在 AS 外部 LSA,因此使用一种特殊的 LSA:AS 外部 NSSA LSA(OSPFv4 中的类型 7 LSA)。如图20-7所示,如果区域1配置为NSSA区域,B会有如下LSDB表项:
Routers in an NSSA can redistribute routing information into the network from other sources (such as another routing protocol, or statically configured routes), but external routes from other OSPF routers (in area 0) are blocked at the ABR. Because AS-external LSAs are not allowed within the flooding domain, a special kind of LSA is used instead: the AS-external NSSA LSA (a type 7 LSA in OSPFv4). In Figure 20-7, B would have the following LSDB entries if area 1 is configured as an NSSA:
• [C,D] 广播链路的网络 LSA(伪节点),其中包括到 C、D 和 101::/64 的连接
• A network LSA (pseudonode) for the [C,D] broadcast link, which would include a connection to C, D, and 101::/64
• 来自 D 的路由器 LSA,具有到 [C,D] 网络 LSA(伪节点)的连接、到 C 的连接以及到 101::/64 的连接
• A router LSA from D with a connection to the [C,D] network LSA (pseudo-node), a connection to C, and a connection to 101::/64
• 来自 C 的路由器 LSA,具有到 [C,D] 网络 LSA(伪节点)的连接、到 C 的连接、到 101::/64 的连接、到 B 的连接以及到 102:: 的连接。 /64
• A router LSA from C with a connection to the [C,D] network LSA (pseudo-node), a connection to C, a connection to 101::/64, a connection to B, and a connection to 102::/64
• 来自 B 的路由器 LSA,具有到 C 的连接以及到 102::/64 的连接
• A router LSA from B with a connection to C and a connection to 102::/64
• 来自 D 的 AS 外部 NSSA LSA,携带 100::/64
• An AS-external NSSA LSA from D carrying 100::/64
特殊的 NSSA AS-external 不能泄露到区域外,因此 ABR 必须将其转换为标准的 AS-external LSA,然后再将其发送到区域 0。转换后,如果没有手动配置聚合,A 将收到有关区域 1 的以下信息:
The special NSSA AS-external cannot be leaked outside the area, so the ABR must translate it into a standard AS-external LSA before sending it into area 0. Given this translation, if no aggregation is manually configured, A would receive the following information about area 1:
• 由D 发起的100::/64 的AS 外部LSA。正常OSPF 区域中的ABR 不会以任何方式修改外部LSA。
• An AS-external LSA for 100::/64 originated by D. External LSAs are not modified in any way by the ABR in a normal OSPF area.
• 源自 B 的摘要 LSA,包括 101::64 和 102::/64。从最短路径树的角度来看,这两条路由将显示为连接到 B 本身,B 到达每个目的地的成本保存在摘要LSA。
• A summary LSA including 101::64 and 102::/64 originated at B. From the perspective of the Shortest Path Tree, these two routes will appear to be connected to B itself, with B’s cost to reach each destination preserved in the summary LSA.
完全不那么短的区域(完全 NSSA)是
The totally not-so-stubby area (totally NSSA) is
• 与完全末节区域类似,因为 ABR 只是将包含默认路由 (::/0) 的单个汇总 LSA 发送到外围区域
• Similar to the totally stubby area because the ABR just sends a single summary LSA containing a default route (::/0) into the outlying area
• 与非末节区域 (NSSA) 类似,因为该区域内的路由器可以使用 AS 外部 NSSA LSA 发起外部路由,ABR 将其转换为 AS 外部 LSA,然后将其传输到区域 0
• Similar to the not-so-stubby area (NSSA) because routers within the area can originate external routes using the AS-external NSSA LSA, which the ABR translates into an AS-external LSA, which is then transmitted into area 0
在适当的情况下,拓扑信息的汇总可能会导致区域 0 中的路由器选择一条通往外部目的地的非最佳路径。图 20-8说明了这一点。
It is possible, in the right situation, for the summarization of topology information to cause a router in area 0 to choose a less than optimal path to external destination. Figure 20-8 illustrates.
在图20-8中,如果A只有一条通往100::/64的AS外部路由,它将选择连接到区域1(该路由被重新分配到的外围区域)的最近的ABR。在这种情况下,A 将选择 C,沿着总成本为 30 的路径将所有流量发送到 100::/64 目的地。有一条总成本为 25 的路径可用,但 A 不“知道”这条路径,因为区域 1 的内部拓扑通过 ABR 上的 OSPF 汇总过程被隐藏。
In Figure 20-8, if A just has a single AS-external route toward 100::/64, it will choose the closest ABR connected to area 1 (the outlying area that the route is being redistributed into). In this case, A would choose C, sending all traffic toward the 100::/64 destination along a path with a total cost of 30. There is a path with a total cost of 25 available, but A does not “know” about this path, as the internal topology of area 1 is hidden through the OSPF summarization process at the ABRs.
为了解决这个问题,OSPF ABR 将为每个 ASBR(或将可达目的地重新分配到 OSPF 中的每个路由器)生成一个区域间路由器 LSA。区域间路由器 LSA 包含 ABR 到达特定 ASBR 的成本。那么,在这个网络中,假设区域 1 是某种支持重新分配的区域,A 的 LSDB 中至少会有以下条目:
To resolve this problem, OSPF ABRs will generate an inter-area router LSA for each ASBR (or each router which is redistributing reachable destinations into OSPF). The inter-area router LSA contains the ABR’s cost to reach a particular ASBR. In this network, then, assuming area 1 is some sort of area that supports redistribution, A will have at least the following entries in its LSDB:
• 由 D 发起的 2001:db8:3e8:100::/64 的 AS 外部 LSA(这可能是翻译后的 AS 外部 NSSA LSA,但 A 不会知道这两种可能性之间的区别)
• An AS-external LSA for 2001:db8:3e8:100::/64 originated by D (this could potentially be a translated AS-external NSSA LSA, but A will not know the difference between these two possibilities)
• B 生成的区域间路由器 LSA,到达 D 的成本为 10
• An inter-area router LSA generated by B with a cost of 10 to reach D
• C 生成的区域间路由器 LSA,到达 D 的成本为 20
• An inter-area router LSA generated by C with a cost of 20 to reach D
使用此信息,A 可以比较
Using this information, A can compare
• 通过 B 到 100::/64 的成本,将到 B 的成本与从 B 到 D 的成本相加,总成本为 25
• The cost through B toward 100::/64, by adding the cost to B to the cost from B to D, for a total cost of 25
• 通过 C 到 100::/64 的成本,将到 C 的成本与从 C 到 D 的成本相加,总成本为 30
• The cost through C toward 100::/64, by adding the cost to C to the cost from C to D, for a total cost of 30
附加 LSA 提供 A 需要选择通过 B 到达 100::/64 外部目的地的最佳路径的信息。
The additional LSA provides the information A needs to choose the optimal path to the 100::/64 external destination, through B.
笔记
Note
关于最佳路由和区域间路由器 LSA 的讨论应该让人想起第 1 章中关于状态、优化和表面的讨论。这是一个特定的实例,其中从控制平面中删除状态可能会导致流量不理想,而重新添加信息被用作使流量再次变得更加优化的技术。
This discussion around optimal routing and inter-area router LSAs should bring to mind the discussion in Chapter 1 around state, optimization, and surface. This is a specific instance where removing state from the control plane can result in suboptimal traffic flows, and where adding information back in is used as a technique to make traffic flows more optimal again.
如果您发现各种区域类型令人困惑,那么您并不孤单。网络工程师很难记住哪种区域类型允许哪种信息。如果你可以记住三个简单的规则,但是,您可以轻松地弄清楚在任何 OSPF 实现中应该使用哪种信息:
If you find the various area types confusing, you are not alone; network engineers struggle with remembering which area type permits what kind of information. If you can remember three simple rules, however, you can easily figure out what sort of information should be where in any OSPF implementation:
• 短截线意味着该区域内根本不允许任何外部物体。
• Stub means no externals are allowed in the area at all.
• 不如此表示允许外部人员离开该区域。
• Not-so means externals are allowed out of the area.
• 完全意味着不允许内部人员进入该区域。
• Totally means no internals are allowed into the area.
区域类型旨在减少洪泛域之间承载的信息量,从而减少网络中任何特定路由器需要存储和处理的信息量。许多 OSPF 实现还可以过滤插入汇总 LSA 中的信息;这通常称为类型 3 过滤,即使汇总 LSA 可能不是每个 OSPF 版本中的类型 3。
Area types are designed to decrease the amount of information carried between flooding domains, reducing the amount of information any particular router in the network needs to store and process. Many OSPF implementations can also filter the information inserted into a summary LSA; this is often called type 3 filtering, even though the summary LSA may not be a type 3 in every version of OSPF.
聚合通过将多个可到达的目的地组合成单个目的地来减少网络中的状态。大多数时候,聚合需要汇总,如在 IS-IS 中传输泛洪域的路由信息示例中所示。OSPF 也是如此,在几乎所有情况下,聚合路由信息也涉及汇总拓扑信息。
Aggregation reduces the state in the network by combining multiple reachable destinations into a single destination. Most of the time, aggregation entails summarization, as seen in the example of routing information transiting flooding domains in IS-IS. This is true of OSPF, as well—in almost all cases, aggregating routing information involves summarizing topology information as well.
有没有使用聚合而不进行汇总的情况?边界网关协议(BGP)中有一个较旧的功能,现在通常已被弃用,并且几乎从未真正部署过,它可以聚合路由而不丢弃所有可用的拓扑信息。如图20-9所示。
Is there any case where aggregation is used without summarization? There is an older feature in the Border Gateway Protocol (BGP), now generally deprecated and almost never really deployed, that does aggregate routes without discarding all of available topology information. Figure 20-9 is used to illustrate.
如图20-9所示,一系列BGP自治系统(AS)连接成环。假设如下:
In Figure 20-9, a series of BGP Autonomous Systems (AS) have been connected in a ring. Assume the following:
• 100::/64 通过[65004,65001] 边界向AS65000 和AS65002 通告。
• 100::/64 is advertised through the [65004,65001] boundary, toward AS65000 and AS65002.
• 100::/64 在 [65004,65003] 边界处过滤至 AS65003。
• 100::/64 is filtered at the [65004,65003] boundary toward AS65003.
• 100::/64 和101::/64 路由在[65001,65002] 边界处聚合至AS65002。
• The 100::/64 and 101::/64 routes are aggregated at the [65001,65002] boundary toward AS65002.
如果 AS65004 中的某些路由器更喜欢聚合而不是 AS65004 内的较长前缀路由(例如本地过滤器),则该网络中可能会形成路由环路。可能会发生这种情况,因为 BGP 依赖 AS 路径来防止互联网上出现环路。如何解决这个问题?最明显的解决方案是以某种方式包含有关 AS 路径的足够信息,以防止 100::/60 路由泄漏回聚合组件所连接的任何 AS。
If some router in AS65004 prefers the aggregate over the longer prefix route within AS65004 (such as a local filter), it is possible a routing loop can form in this network. This type of situation can occur because BGP relies on the AS path to prevent loops across an internetwork. How can this problem be resolved? The most obvious solution would be to somehow include enough information about the AS path to prevent the 100::/60 route from being leaked back into any AS where a component of the aggregate is connected.
为了防止形成此类环路,BGP 要求任何聚合路由信息的发言者在聚合更新中包含原子聚合。原子聚合包括组成聚合的任何组件路由的路径中每个 AS 的完整列表。在这种情况下,通告到 AS60552 的 100::/60 聚合将具有一个包含一个条目 65001 的 AS 路径,但它也将包含一个包含 [65004, 65000] 的原子聚合。当自治系统 65003 和 65004 之间的 eBGP 发言者接收到 100::/60 聚合时,它可以检查原子聚合属性并确定源自 AS65004 的聚合路由的至少某些组件。因此,AS65003边缘的eBGP发言者可以拒绝聚合路由,从而防止环路。
To prevent such loops from forming, BGP required any speaker aggregating routing information to include an atomic aggregate in the aggregate update. The atomic aggregate included the full list of every AS in the path of any of the component routes making up the aggregate. In this case, the 100::/60 aggregate advertised into AS60552 would have an AS path with one entry, 65001, but it would also contain an atomic aggregate containing [65004, 65000]. When the eBGP speaker between Autonomous Systems 65003 and 65004 receives the 100::/60 aggregate, it can examine the atomic aggregate attribute and determine at least some component of the aggregated route originated in AS65004. Hence, the eBGP speaker at the edge of AS65003 can reject the aggregate route, preventing the loop.
笔记
Note
已经有很多建议从 BGP 中删除原子聚合;最近的一个是弃用原子聚合。1
There have been many proposals to remove the atomic aggregate from BGP; the most recent is Deprecate Atomic Aggregate.1
虽然网络工程师通常不认为分层是一种信息隐藏形式,但它确实隐藏了有关某些转发设备组的拓扑和可达性的完整信息。两个示例 — 使用 BGP 作为覆盖携带外部路由信息和分段路由 (SR) 与控制器相结合以产生流量工程 (TE) 覆盖——将用于说明分层。
While layering is not normally considered a form of information hiding by network engineers, it definitely does hide full information about the topology and reachability from some set of forwarding devices. Two examples—using BGP as an overlay to carry external routing information and Segment Routing (SR) combined with a controller to produce a traffic engineering (TE) overlay—will be used to illustrate layering.
笔记
Note
像本书这样的入门级书籍不可能深入概述网络中发明和部署的许多可能的分层协议和系统。举一个小例子:基于 MPLS 和 IP 和 IP 隧道的第 3 层虚拟专用网络 (L3VPN)、第 2 层虚拟专用网络 (L2VPN)、以太网 VPN (eVPN)、使用路径计算元素协议 (PCEP) 的流量工程覆盖、VXLAN(具有本机控制平面,尽管隧道封装通常与不同的控制平面一起使用)、802.1q 虚拟局域网 (VLAN)、大量链路透明连接 (TRILL) VLAN 和 SR。涵盖这些主题需要另一整本书,而这本书已经足够大了。
It is impossible for an introductory-level book, such as this one, to give an in-depth overview of the many possible layering protocols and systems invented and deployed in networks. To give a small sample: Layer 3 virtual private networks (L3VPNs) based in MPLS and IP and IP tunnels, Layer 2 virtual private networks (L2VPNs), Ethernet VPNS (eVPNs), traffic-engineered overlays using Path Computation Element Protocol (PCEP), VXLAN (which has a native control plane, although the tunneling encapsulation is often used with a different control plane), 802.1q virtual local area networks (VLANs), Transparent Connection of Lots of Links (TRILL) VLANs, and SR. Covering these topics would require another entire book, and this one is quite large enough. Readers who would like to read more on these topics should look at the “Further Reading” section at the end of this chapter for more information.
边界网关协议(BGP)最初设计用于承载自治系统间(inter-AS)信息;它显然不是为了在 AS 内携带可达性信息而设计的。基本设计是将内部可达性(AS 内)与外部可达性(AS 外,或默认自由区或 DFZ 中)分开,以便
The Border Gateway Protocol (BGP) was originally designed to carry inter-Autonomous System (inter-AS) information; it was explicitly not designed to carry reachability information within an AS. The basic design was to separate internal reachability (within the AS) from external reachability (outside the AS, or in the default free zone, or DFZ), in order to
• 防止网络外部的变化影响网络本身的运行
• Prevent changes outside the network from impacting the operation of the network itself
• 允许对内部和外部路由应用不同的策略
• Allow different policies to be applied to internal and external routes
第一个原因是为了防止网络外部的变化影响网络本身的运行,这应该是隐藏信息的一个常见原因——将网络分解为多个故障域。图 20-10说明了这一点。
The first reason, to prevent changes external to the network from impacting the operation of the network itself, should be a familiar reason to hide information—to break up a network into multiple failure domains. Figure 20-10 illustrates.
在图 20-10中:
In Figure 20-10:
• IS-IS 在B、C、D 和E 上运行,提供AS 内的可达性和拓扑信息。
• IS-IS is running on B, C, D, and E to provide reachability and topology information within the AS.
• E and F are configured with an external BGP (eBGP) session.
• A 和B 配置了eBGP 会话。
• A and B are configured with an eBGP session.
• 每对[B,C]、[B,D]、[D,E] 和[C,E] 均配置有一个内部(iBGP) 会话。
• Each pair of [B,C], [B,D], [D,E], and [C,E] is configured with an interior (iBGP) session.
• C 和D 充当B 和E 的路由反射器。
• C and D are acting as route reflectors for B and E.
通过网络追踪路由通告到100::/64的路径:
Tracing the path of the route advertisement to 100::/64 through the network:
• F 通过 eBGP 会话向 E 通告 100::/64;AS 路径设置为 [65002]。
• F advertises 100::/64 to E over the eBGP session; the AS path is set to [65002].
• E 将 100::/64 路由发布给 D 和 C,然后 D 和 C 将该路由反射给 B。
• E advertises the 100::/64 route to D and C, which then reflect the route to B.
• B 通过 eBGP 会话将路由通告给 A;AS 路径设置为 [65002,65001]。
• B advertises the route to A over the eBGP session; the AS path is set to [65002,65001].
AS内部运行的IGP(Interior Gateway Protocol)根本不需要携带100::/64路由。IS-IS 仅携带内部目的地,例如 100f::/64。另一种说法是,IS-IS 提供内部可达性,允许 BGP 通过 AS 形成会话,而 BGP 携带可达性信息,允许其他自治系统通过本地 AS(通过 AS65001 转发来自 A 的流量,以及到 F)。
The Interior Gateway Protocol (IGP), which runs within the AS, does not need to carry the 100::/64 route at all. IS-IS carries just internal destinations, such as 100f::/64. Another way to put this is, IS-IS provides the internal reachability to allow BGP to form sessions through the AS, while BGP carries the reachability information allowing other Autonomous Systems to transit the local AS (to forward traffic from A, through AS65001, and on to F).
这种职责分离是一种分层形式。BGP 覆盖 IS-IS(IGP),使用 IGP 形成邻接并发现 AS 内可以转发流量的路径。另一方面,IS-IS 不需要知道任何外部可到达的目的地。如何将网络划分为两个故障域?
This separation of duties is a form of layering; BGP overlays IS-IS (the IGP), using the IGP to form adjacencies and discover paths within the AS along which it can forward traffic. IS-IS, on the other hand, does not need to know about any externally reachable destinations. How does this divide the network into two failure domains?
首先,IS-IS(或任何其他 IGP)不受 AS 外部拓扑和可达性信息变化的影响。如果 F 和 100::/64 之间的链路发生变化,则运行在 B、C、D 和 E 上的 IS-IS 进程不需要重新计算任何内容,因为从它们的角度来看,网络中的任何内容都没有发生变化。
First, IS-IS (or any other IGP) is not impacted by changes to topology and reachability information external to the AS. If the link between F and 100::/64 changes, the IS-IS processes running on B, C, D, and E do not need to recalculate anything, as nothing in the network has changed from their perspective.
其次,对等自治系统不受 AS65001 内更改的影响。例如,如果[C,E]链路发生故障,从A的角度来看路径没有改变;AS路径保持不变,因此BGP不需要重新收敛。
Second, peering Autonomous Systems are shielded from changes within AS65001. For instance, if the [C,E] link fails, the path has not changed from the perspective of A; the AS path remains the same, so BGP does not need to reconverge.
内部和外部拓扑以及可达性信息的命运(至少在很大程度上)是相互分离的;因此内部路由和外部路由是两个不同的故障域。
The fate of internal and external topology and reachability information is (at least to a large degree) separated from one another; hence the internal routing and the external routing are two different failure domains.
笔记
Note
对于 AS 级策略与 AS 内的路由变化相结合的频率和情况进行了一些研究,该策略将“泄漏”到 DFZ 的其他部分。2此类信息泄漏跨越了故障域边界,至少在一小部分中合并了故障域。这是一个泄漏抽象的例子,我们将在第 1 章中讨论。
There has been some research into how often, and under what circumstances, AS level policies, combined with route changes within an AS, will “leak” out to the rest of the DFZ.2 These kinds of information leaks cross what should be a failure domain boundary, merging the failure domains at least in some small part. This is an example of leaky abstractions, which are discussed in Chapter 1.
SR 可能是多协议标签交换 (MPLS) 最简单的应用,无需通过网络手动配置点对点隧道。SR 背后的总体思想是在网络的一个边缘堆叠一组标签,因此路径上的每个设备都可以根据堆栈上暴露的最外层标签进行转发。当每个设备从堆栈中弹出最外层标签时,会公开一个新标签,描述路径中的下一跳。如图20-11所示。
SR is perhaps the simplest possible use of Multiprotocol Label Switching (MPLS) short of manually configured point-to-point tunnels through a network. The general idea behind SR is to stack a set of labels at one edge of the network, so each device along the path can forward based on the outermost label exposed on the stack. As each device pops the outermost label off the stack, a new label is exposed describing the next hop in the path. Figure 20-11 is used for illustration.
笔记
Note
这里对 SR 进行了非常高的描述。SR的设计、部署、运行还有很多细节;请参阅本章末尾的“进一步阅读”部分,以获取有关 SR 的良好参考资料。
SR is described here at a very high level. There are many more details in the design, deployment, and operation of SR; please refer to the “Further Reading” section at the end of the chapter for good references on SR.
如图20-11所示,A收到发往100::/64的报文。基于 IP 路由,该数据包将沿着 [A,B,D,E,F] 通过成本最低的路径转发。如果网络运营商希望数据包沿着备用路径 [A,B,C,E,F] 传输怎么办?它当然,可以修改路径上的度量,但这会影响在 A 处进入网络并发往 100::/64的所有流量。
In Figure 20-11, A receives a packet destined to 100::/64. Based on IP routing, this packet will be forwarded across the lowest-cost path, along [A,B,D,E,F]. What if the network operator wants the packet to travel along the alternate path, [A,B,C,E,F]? It is possible to modify the metrics along the paths, of course, but this would impact all the traffic entering the network at A and destined to 100::/64.
多种覆盖技术可以使用多种不同的控制平面和多种不同的封装来解决此类问题。如果将每种可能的封装与每种可能的控制平面相乘,您可能会发现解决此问题的方法比正常术语要多。也许真正的解释是无聊的工程师,他们喜欢以尽可能多的方式解决同一问题的挑战。SR是解决这个问题比较简单的方法。如果运营商的网络部署了SR,他可以
Several overlay technologies can solve this type of problem using a number of different control planes and a number of different encapsulations. If you multiply every possible encapsulation with every possible control plane, you will probably find there are more ways to solve this problem than can be explained in normal terms. Perhaps the real explanation is bored engineers who enjoy the challenge of solving the same problem in as many ways as possible. SR is a relatively simple way to solve this problem. If the operator has SR deployed on his network, he can
• 计算从A 通过[B,C,E] 到F 的路径。
• Compute the path from A to F through [B,C,E].
• 沿途发现每个设备的MPLS 标签;生成的标签堆栈将为 [30,31,33,34]。
• Discover the MPLS label for each device along the way; the resulting label stack would be [30,31,33,34].
• 当数据包在 A 处交换时,将此标签堆栈施加到数据包上。
• Impose this label stack on the packet while it is being switched at A.
一旦这个标签栈被强加在A处,交换路径将是
Once this label stack has been imposed at A, the switching path would be
• A 会将数据包转发到标签 30,即 B。
• A would forward the packet to label 30, which is B.
• 当B收到这个数据包时,它会弹出堆栈中最外层的标签;堆栈现在是[31,33,34]。
• When B receives this packet, it pops the outermost label on the stack; the stack is now [31,33,34].
• B 会将数据包切换到 31,即堆栈上最外层的标签,即 C。
• B will switch the packet toward 31, the outermost label on the stack, which is C.
• 当C收到这个数据包时,它会弹出堆栈中最外层的标签;堆栈现在是[33,34]。
• When C receives this packet, it pops the outermost label on the stack; the stack is now [33,34].
• C 会将数据包切换到 33,即 E。
• C will switch the packet toward 33, which is E.
• 当E收到这个数据包时,它会弹出堆栈中最外层的标签;堆栈现在是[34]。
• When E receives this packet, it pops the outermost label on the stack; the stack is now [34].
• E 会将数据包切换到 34,即 F。
• E will switch the packet toward 34, which is F.
最后,F将最终标签从堆栈中弹出,并根据目的IP地址转发流量。堆栈A施加的标签从何而来?与往常一样,存在大量的可能性,但为了使示例尽可能简单,假设有一个控制器位于网络上的某个位置,在图 20-11 中标记为G。
Finally, F will pop the final label off the stack and forward the traffic based on the destination IP address. Where does the label that stack A imposes come from? There are, as always, a large number of possibilities, but in order to keep the example as simple as possible, assume there is a controller located someplace on the network, labeled as G in Figure 20-11.
该控制器可以参与路由协议来发现网络的拓扑和可到达的目的地(包括已分配给每个设备的MPLS标签)。将这些信息与一组策略相结合后,它可以计算出通过网络的正确流量工程路径,然后向 A 发出信号,告诉 A 对该特定流施加什么标签堆栈。
This controller can participate in the routing protocol to discover the topology of the network and the reachable destinations (including what MPLS label has been assigned to each device). After combining this information with a set of policies, it can calculate the correct traffic-engineered path through the network and then signal A about what label stack to impose on this particular flow.
这种分层通过以下方式减少状态(隐藏信息)
This kind of layering reduces state (hides information) by
• 允许将流量工程策略从分布式控制平面中拉出,从而大大减少分布式控制平面中的状态
• Allowing the traffic engineering policy to be pulled out of the distributed control plane, reducing the state in the distributed control plane considerably
• 从分布式控制平面的权限中删除邻居发现和其他分布式元素的过程
• Removing the process of neighbor discovery and other distributed elements from the purview of the distributed control plane
即使策略控制器发生故障,网络仍将转发流量,这意味着控制器已被置于不同的故障域中。
Even if the policy controller fails, the network will still forward traffic, which means the controller has been placed into a different failure domain.
协议中经常使用的第三种隐藏信息的技术是简单地减慢信息通过网络分发的速度。从技术上来说,减慢状态速度并不能永久隐藏信息:它要么允许网络设备“聚集信息”,需要间隔更远的更短的处理突发,要么允许网络设备以稳定的速度接收信息,要么消除来自网络的控制平面状态的重复副本。有许多不同的方法可以降低网络控制平面中的状态速度,但这里考虑两个示例:指数退避和洪泛减少。
A third technique often used in protocols to hide information is to simply slow down the rate at which information is distributed through the network. Slowing down state velocity does not technically hide information in the permanent sense: it either allows network devices to “bunch up information” requiring shorter bursts of processing spaced farther apart, it allows network devices to take on information at a steady pace, or it removes duplicate copies of control plane state from the network. There are many different ways to reduce the velocity of state in a network control plane, but two examples are considered here: exponential backoff and flooding reduction.
指数退避用于多种情况,包括
Exponential backoff is used in a wide variety of contexts, including
• 减慢(抑制)路由在互联网络中传播的速度
• In slowing down (dampening) the speed at which routes are propagated throughout an internetwork
• 在某些网络操作系统实现中减慢(抑制)允许传播接口状态更改的速度
• In slowing down (dampening) the speed at which interface state change is allowed to propagate in some network operating system implementations
• 降低链路状态协议在接收新拓扑信息时计算新最短路径树 (SPT) 的速度
• In slowing down the speed at which a link state protocol will compute a new Shortest Path Tree (SPT) on receiving new topology information
• 减慢链路状态协议分发路由信息的速度,以响应链路状态的变化
• In slowing down the speed at which routing information is distributed by a link state protocol in response to a change in the state of a link
本节将使用在链路状态协议中运行 SPF 作为示例,但您应该记住,有很多地方可以使用指数退避。为了理解指数退避,需要几个定义:
This section will use running SPF in a link state protocol as an example, but you should keep in mind there are many places where exponential backoff can be used. To understand exponential backoff, several definitions will be needed:
•初始等待:接收事件后、处理该事件之前实现所等待的时间量
• Initial wait: The amount of time the implementation will wait after receiving an event before processing it
•第二次等待:乘以指数偏移量以设置后续事件的等待时间
• Second wait: Multiplied by an exponential offset to set the wait time on subsequent events
•最大等待时间:用于两个目的:
• Max wait: Used for two purposes:
•事件发生后,在将等待计时器设置回初始等待之前,实现将等待的时间量
• The amount of time the implementation will wait after an event before setting the wait timer back to initial wait
•实现将设置等待计时器的最长时间
• The maximum amount of time the implementation will ever set the wait timer
•等待计时器:在处理当前处理队列中的项目之前实现将等待的时间量
• Wait timer: The amount of time the implementation will wait before processing the items currently in the processing queue
指数退避过程在伪代码中看起来像这样:
The exponential backoff process looks something like this in pseudocode:
//initial_wait == 初始等待时间
// secondary_wait == 第二个等待时间
// max_wait == 最大等待时间
// next_wait == wait_timer 到期前多久//
当 Reset_timer 到期时,
先让 Reset_timer到期 {
stop wait_timer
next_wait = initial_wait
backoff = 1
}
when event_occurrs {
if wait_timer is running {
next_wait = backoff * secondary_wait
if next_wait > max_wait {
next_wait = max_wait
}
backoff = backoff * 2
} else {
next_wait = initial_wait
}
设置要在 wait_time 处理的事件
启动reset_timer,使其在max_wait 内过期 * 2
启动wait_timer,使其在next_wait 内过期
}
// initial_wait == initial wait time
// second_wait == second wait time
// max_wait == max wait time
// next_wait == how long before the wait_timer expires
// begin by expiring the reset_timer
when reset_timer expires {
stop wait_timer
next_wait = initial_wait
backoff = 1
}
when event_occurs {
if wait_timer is running {
next_wait = backoff * second_wait
if next_wait > max_wait {
next_wait = max_wait
}
backoff = backoff * 2
} else {
next_wait = initial_wait
}
set event to process at wait_time
start reset_timer to expire in max_wait * 2
start wait_timer to expire in next_wait
}
图 20-12用于解释在确定何时运行 SPF 时的指数退避。
Figure 20-12 is used to explain exponential backoff in determining when to run SPF.
假设某些路由器配置为使用指数退避来减少运行 SPF 所需的处理量。使用图 20-12,事件顺序可能如下所示:
Assume some router is configured to use exponential backoff to reduce the amount of processing required for running SPF. Using Figure 20-12, the sequence of events might look like this:
1. 路由器开始时将wait_time设置为initial_wait。
1. The router begins with the wait_time set to initial_wait.
2. 路由器接收新的链路状态条目(对于本例来说,LSA 还是 LSP 并不重要;它可以是 OSPF 或 IS-IS)。
2. The router receives a new link state entry (whether an LSA or an LSP doesn’t matter for this example; it could be OSPF or IS-IS).
3. 活动被接受,并且
3. The event is accepted, and
A。定时器设置为max_wait * 2;这可以称为reset_timer。
a. A timer is set to max_wait * 2; this can be called the reset_timer.
b. 为该特定事件设置计时器;一旦该计时器到期,该事件将被处理。
b. A timer is set for this specific event; once this timer expires, the event will be processed.
4. 在reset_timer到期之前,接收到第二个事件。
4. Before reset_timer expires, a second event is received.
A。Reset_timer重新启动,因此它会在max_wait * 2内再次过期。
a. The reset_timer is restarted, so it will again expire in max_wait * 2.
b. 为该特定事件设置计时器;一旦该计时器到期,该事件将被处理。
b. A timer is set for this specific event; once this timer expires, the event will be processed.
C。wait_time设置为secondary_timer * 1,因为这是第二个事件。
c. The wait_time is set to second_timer * 1, as this is the second event.
5. Before reset_timer expires, a second event is received.
A。Reset_timer重新启动,因此它会在max_wait * 2内再次过期。
a. The reset_timer is restarted, so it will again expire in max_wait * 2.
b. 为该特定事件设置计时器;一旦该计时器到期,该事件将被处理。
b. A timer is set for this specific event; once this timer expires, the event will be processed.
C。wait_time设置为secondary_timer * 2 ,因为这是第三个事件。
c. The wait_time is set to second_timer * 2, as this is the third event.
wait_time设置为事件编号乘以 2(这可以在某些实现上配置),因此每个事件的wait_time加倍。如果wait_time达到 max_wait ,它将被限制在max_wait。如果重置计时器到期(距离最后一个事件为max_wait * 2),则整个系统将重置为其初始状态。这个过程产生一个wait_time ,如图20-12所示;计时器值呈指数增加,直到达到上限。如果长时间没有事件发生,整个系统就会重置。
The wait_time is set to the event number multiplied by 2 (this can be configured on some implementations), so the wait_time doubles with each event. If the wait_time ever reaches the max_wait, it will be capped at max_wait. And if the reset timer ever expires, which is max_wait * 2 from the last event, the entire system resets to its initial state. This process produces a wait_time as shown in Figure 20-12; the timer value increases exponentially until it reaches a cap. If no events happen for some long period, the entire system resets.
为什么要采用指数退避?因为它让系统一开始反应很快,然后减慢其反应时间,直到系统处于可接受的最慢速度。在 SPF 运行的情况下,第一次 SPF 运行会很快发生,但随着接收到更多的链路状态更新,SPF 运行会分散得更远,直到达到某个最大值。这允许对单个事件做出快速反应,同时降低处理大量快速发生的事件的速率。
Why an exponential backoff? Because it allows the system to react quickly at first, but then to slow down its reaction time until the system is at the slowest acceptable speed. In the case of an SPF run, the first SPF run would occur quickly, but as more link state updates are received, the SPF runs are spread farther apart until some maximum is reached. This allows for fast reactions to individual events, while dampening the rate at which a large number of quickly occurring events is processed.
在将链路状态协议部署到高度网状的移动网络时,收敛所需的洪泛量可能是一个限制因素。如图20-13所示。
In deploying link state protocols onto highly meshed mobile networks, the amount of flooding required to converge can be a limiting factor. Figure 20-13 is used to illustrate.
在图 20-13中:
In Figure 20-13:
• 标记列和行,而不是标记每个单独的路由器;例如,A1 位于左上角,而 D5 位于右下角。
• The columns and rows are marked instead of each individual router being marked; A1, for instance, is at the top-left corner, while D5 is at the lower-right corner.
• 等级编号标记在右侧;T0 中的路由器是架顶式 (ToR) 交换机(或路由器)。
• The tier numbers are marked on the right side; routers in T0 are Top of Rack (ToR) switches (or routers).
如果A5发生一些变化,那么
If some change occurs at A5, then
• A5 将向第 4 行中的每个路由器洪泛链路状态更改。
• A5 will flood a link state change to every router in row 4.
• 第 4 行中的每个路由器都会将链路状态更改洪泛到 A3。
• Every router in row 4 will flood a link state change to A3.
然后,A3 将收到相同链路状态更改的四个副本;事实上,结构中的每个路由器都会收到至少四个相同链路状态更改的副本,有些路由器会收到更多副本。在这种拓扑中如何减少副本数量?通过构建距离任何正在泛洪更改的路由器两跳的路由器视图,并以某种方式仅向其中一个路由器发出信号以重新泛洪链路状态改变它的邻居。例如,如果 A5 发现 A4 可以到达距离 A5 本身两跳的每个路由器,A5 可以将链路状态更改发送到 B4-D4 格式,这样他们就不会重新淹没它,同时以允许的方式将更新发送到 A4 A4 重新泛滥变革。
A3, then, will receive four copies of the same link state change; in fact, every router in the fabric will receive at least four copies of the same link state change, and some will receive more copies. How can the number of copies be reduced in this topology? By building a view of the routers two hops away from any router that is flooding a change, and somehow signaling just one of them to reflood the link state change to its neighbors. For instance, if A5 can discover that A4 can reach every router two hops away from A5 itself, A5 can send the link state change to B4-D4 formatted so they will not reflood it, while sending the update to A4 in a way that allows A4 to reflood the change.
但是 A5 如何确定哪些路由器相距两跳呢?至少设计并实施了两种方法:
But how can A5 determine which routers are two hops away? At least two methods have been devised and implemented:
• 每个路由器都可以向每个邻居报告其完整的邻居集,而不仅仅是此链路上的邻居。例如,A4 可以向 A5 报告邻居集 [A3,B3,C3,D3,B5,C5,D5],而不仅仅是 [A5](通常会在邻接形成期间验证双向连接) 。
• Each router can report its full set of neighbors to every neighbor, rather than just the neighbors on this link. For instance, A4 can report the set of neighbors [A3,B3,C3,D3,B5,C5,D5] to A5, rather than just [A5] (as it would normally do to verify two-way connectivity during adjacency formation).
• 一旦形成初始邻接关系,就可以在限制为两跳的 A5 上运行 SPF。
• Once the initial adjacencies are formed, an SPF can be run at A5 that is restricted to two hops.
一旦 A5 发现其邻居的所有邻居,它就可以构建一个最小的邻居列表来泛洪以“覆盖”整个两跳邻居集。鉴于 A4、B4、C4 和 D4都具有相同的邻居集,将这些邻居中的任何一个指定为“reflooder”将确保 LSDB 的更改通过网络同步。此列表中的任何邻居都应该以允许它们重新淹没的方式接收更改;不在此列表中的邻居应该以不允许他们重新传播更改的方式接收链路状态信息。有多种方法可以修改链路状态协议以以此方式限制洪泛范围。
Once A5 discovers all of its neighbor’s neighbors, it can build a minimum list of neighbors to flood to “cover” the entire set of two-hop neighbors. Given A4, B4, C4, and D4 all have the same set of neighbors, designating any of these neighbors as a “reflooder” will ensure the changes to the LSDB are synchronized through the network. Any neighbors on this list should receive changes in a way that allows them to reflood; neighbors not on this list should receive link state information in a way that does not allow them to reflood the changes. There are a number of ways a link state protocol can be modified to limit flooding scope in this way.
概述图 20-13所示网络中的流程:
To outline the process in the network shown in Figure 20-13:
1. A5发现拓扑或可达性信息发生变化。
1. A5 discovers some change to the topology or reachability information.
2. A5 确定 A4、B4、C4 和 D4 都有相同的两跳邻居集。
2. A5 determines that A4, B4, C4, and D4 all have the same set of two-hop neighbors.
3. A5选择一个邻居作为重新泛洪者(或指定泛洪者);假设这是A4。
3. A5 selects one neighbor as a reflooder (or designated flooder); assume this is A4.
4. A5正常泛洪到A4。
4. A5 floods to A4 normally.
5. A5 使用不允许 B4、C4 和 D4 重新洪泛更改的机制洪泛它们。
5. A5 floods to B4, C4, and D4 using a mechanism that does not allow them to reflood the change.
6. A4 确定 A3、B3、C3 和 D3 都具有相同的两跳邻居集。
6. A4 determines that A3, B3, C3, and D3 all have the same set of two-hop neighbors.
7. A4选择一个邻居作为reflooder(或指定的flooder);假设这是A3。
7. A4 selects one neighbor as a reflooder (or designated flooder); assume this is A3.
9. A4 使用不允许重新洪泛更改的机制向 B3、C3 和 D3 洪泛。
9. A4 floods to B3, C3, and D3 using a mechanism that does not allow them to reflood the change.
这项技术的内容比这里概述的要多一些;请参阅“进一步阅读”部分,了解有关该技术以及与该技术类似的其他洪水减少机制的更多信息。
There is a bit more to this technique than is outlined here; refer to the “Further Reading” section to find out more about this technique, as well as other flooding reduction mechanisms similar to this technique.
总结信息的最初问题似乎相当简单;它可以在简单网络中的远程协议框架内工作。然而,在链路状态协议中,无需聚合即可进行汇总的能力以及在网络中的特定位置进行聚合和汇总的要求使得汇总和汇总变得更加困难。很多时候,想法和概念是粘在一起的,导致每个想法本身都难以理解。然而,理清这些想法以使其更容易理解也可能很困难。添加外部路由信息使得链路状态协议中的汇总和聚合更加复杂。
The initial problem of summarizing information appears to be fairly simple; working from within the framework of a distance protocol in a simple network, it can be. In link state protocols, however, the ability to summarize without aggregation and the requirement for aggregation and summarization to take place at a specific place in the network make summarization and aggregation more difficult. Quite often, ideas and concepts are agglutinated, causing each idea to be difficult to understand on its own. Disentangling the ideas, however, to make them easier to understand can prove difficult as well. Adding in external routing information makes summarization and aggregation more complex in a link state protocol.
作为一名网络工程师,理解所有这些概念都非常重要。结合任何给定方法如何承载控制平面信息、每个最短路径算法如何在任何给定拓扑上工作以及如何以及在何处聚合和/或汇总信息的基础知识,可以让您快速了解网络如何正常工作,以及各种故障情况。
All of these concepts are extremely important for you to understand as a network engineer. Combining a base knowledge of how any given method of carrying control plane information, how each shortest path algorithm works on any given topology, and how and where information is aggregated and/or summarized can give you a quick read of how a network will work normally, as well in various failure situations.
这种思路提供了从协议思考转向网络操作思考的良好衔接。
This line of thinking provides a good segue to move from thinking about protocols to thinking about network operation.
克里斯托弗·迪尔洛夫、托马斯·H·克劳森、乌尔里希·赫伯格和菲利普·雅凯。优化链路状态路由协议版本 2。征求意见 7181。RFC 编辑,2014。doi:10.17487/rfc7181。
Dearlove, Christopher, Thomas H. Clausen, Ulrich Herberg, and Philippe Jacquet. The Optimized Link State Routing Protocol Version 2. Request for Comments 7181. RFC Editor, 2014. doi:10.17487/rfc7181.
弗格森、丹尼斯、阿西·林德姆和约翰·莫伊。用于 IPv6 的 OSPF。征求意见 5340。RFC 编辑,2008。doi:10.17487/rfc5340。
Ferguson, Dennis, Acee Lindem, and John Moy. OSPF for IPv6. Request for Comments 5340. RFC Editor, 2008. doi:10.17487/rfc5340.
野兔,苏珊。“弃用原子聚合。” 互联网草案。互联网工程任务组,2017 年 3 月。https: //datatracker.ietf.org/doc/html/draft-hares-deprecate-atomic-aggregate-00。
Hares, Susan. “Deprecate Atomic Aggregate.” Internet-Draft. Internet Engineering Task Force, March 2017. https://datatracker.ietf.org/doc/html/draft-hares-deprecate-atomic-aggregate-00.
“与提供无连接模式网络服务的协议结合使用的中间系统到中间系统域内路由信息交换协议。” 标准。瑞士日内瓦:国际标准化组织,2002 年。http: //standards.iso.org/ittf/PubliclyAvailableStandards/。
“Intermediate System to Intermediate System Intra-Domain Routing Information Exchange Protocol for Use in Conjunction with the Protocol for Providing the Connectionless-Mode Network Service.” Standard. Geneva, CH: International Organization for Standardization, 2002. http://standards.iso.org/ittf/PubliclyAvailableStandards/.
雅凯、菲利普. 优化链路状态路由协议(OLSR)。征求意见 3626。RFC 编辑,2003。doi:10.17487/rfc3626。
Jacquet, Philippe. Optimized Link State Routing Protocol (OLSR). Request for Comments 3626. RFC Editor, 2003. doi:10.17487/rfc3626.
卡茨、戴夫. “OSPF 和 IS-IS:比较剖析。” 于 2000 年 6 月 12 日在新墨西哥州阿尔伯克基举行的 NANOG19 上发表。https://nanog.org/meetings/abstract?id=1084。
Katz, Dave. “OSPF and IS-IS: A Comparative Anatomy.” Presented at the NANOG19, Albuquerque, NM, June 12, 2000. https://nanog.org/meetings/abstract?id=1084.
莫伊、约翰. OSPF 版本 2。征求意见。RFC 编辑,1998 年 4 月。doi:10.17487/RFC2328。
Moy, John. OSPF Version 2. Request for Comments. RFC Editor, April 1998. doi:10.17487/RFC2328.
Nguyen、Dang-Quan、Thomas H. Clausen、Philippe Jacquet 和 Emmanuel Baccelli。
Nguyen, Dang-Quan, Thomas H. Clausen, Philippe Jacquet, and Emmanuel Baccelli.
用于 Ad Hoc 网络的 OSPF 多点中继 (MPR) 扩展。征求意见 5449。RFC 编辑,2009。doi:10.17487/rfc5449。
OSPF Multipoint Relay (MPR) Extension for Ad Hoc Networks. Request for Comments 5449. RFC Editor, 2009. doi:10.17487/rfc5449.
Ogier,Richard G。OSPF -MDR 在单跳广播网络中的使用。征求意见 7038。RFC 编辑,2013。doi:10.17487/rfc7038。
Ogier, Richard G. Use of OSPF-MDR in Single-Hop Broadcast Networks. Request for Comments 7038. RFC Editor, 2013. doi:10.17487/rfc7038.
奥吉尔、理查德和菲尔·斯帕格诺洛。使用连接支配集 (CDS) 泛洪的 OSPF 移动自组织网络 (MANET) 扩展。征求意见 5614。RFC 编辑,2009。doi:10.17487/rfc5614。
Ogier, Richard, and Phil Spagnolo. Mobile Ad Hoc Network (MANET) Extension of OSPF Using Connected Dominating Set (CDS) Flooding. Request for Comments 5614. RFC Editor, 2009. doi:10.17487/rfc5614.
佩尔瑟、克里斯特尔、兰迪·布什、基尔·帕特尔、普罗多什·莫哈帕特拉和奥拉夫·梅内尔。使路线襟翼阻尼可用。征求意见 7196。RFC 编辑,2014。doi:10.17487/RFC7196。
Pelsser, Cristel, Randy Bush, Keyur Patel, Prodosh Mohapatra, and Olaf Maennel. Making Route Flap Dampening Usable. Request for Comments 7196. RFC Editor, 2014. doi:10.17487/RFC7196.
普齐吉安达、托尼、约翰·德雷克和阿莉亚·阿特拉斯。“RIFT:胖树中的路由。” 互联网草案。互联网工程任务组,2017 年 1 月。https: //datatracker.ietf.org/doc/html/draft-przygienda-rift-01。
Przygienda, Tony, John Drake, and Alia Atlas. “RIFT: Routing in Fat Trees.” Internet-Draft. Internet Engineering Task Force, January 2017. https://datatracker.ietf.org/doc/html/draft-przygienda-rift-01.
雷赫特、雅科夫、苏珊·黑尔斯和托尼·李。边界网关协议 4 (BGP-4)。
Rekhter, Yakov, Susan Hares, and Tony Li. A Border Gateway Protocol 4 (BGP-4).
征求意见 4271。RFC 编辑,2006。doi:10.17487/rfc4271。
Request for Comments 4271. RFC Editor, 2006. doi:10.17487/rfc4271.
雷塔纳、阿尔瓦罗和斯坦·拉特利夫。在单跳广播网络中使用 OSPF-MANET 接口。征求意见 7137。RFC 编辑,2014。doi:10.17487/rfc7137。
Retana, Alvaro, and Stan Ratliff. Use of the OSPF-MANET Interface in Single-Hop Broadcast Networks. Request for Comments 7137. RFC Editor, 2014. doi:10.17487/rfc7137.
罗伊、阿拜、易阳和阿尔瓦罗·雷塔纳。在 OSPF 中隐藏仅传输网络。征求意见 6860。RFC 编辑,2013。doi:10.17487/rfc6860。
Roy, Abhay, Yi Yang, and Alvaro Retana. Hiding Transit-Only Networks in OSPF. Request for Comments 6860. RFC Editor, 2013. doi:10.17487/rfc6860.
Shen、Naiming、Les Ginsberg 和 Sanjay Thyamagundalu。“用于主干-叶拓扑的 IS-IS 路由。” 互联网草案。互联网工程任务组,2017 年 3 月。https: //datatracker.ietf.org/doc/html/draft-shen-isis-spine-leaf-ext-03。
Shen, Naiming, Les Ginsberg, and Sanjay Thyamagundalu. “IS-IS Routing for Spine-Leaf Topology.” Internet-Draft. Internet Engineering Task Force, March 2017. https://datatracker.ietf.org/doc/html/draft-shen-isis-spine-leaf-ext-03.
Teixeira、Renata 等人,“IP 网络中热点路由变化的影响”,IEEE/ACM Transactions on Networking 16,第 1 期。6(2008 年 12 月):1295–307,doi:10.1109/TNET.2008.919333。
Teixeira, Renata, et al., “Impact of Hot-Potato Routing Changes in IP Networks,” IEEE/ACM Transactions on Networking 16, no. 6 (December 2008): 1295–307, doi:10.1109/TNET.2008.919333.
王、莉莉、张朝晖(杰弗里)和尼沙尔·谢思。OSPF 混合广播和点对多点接口类型。征求意见 6845。RFC 编辑,2013。doi:10.17487/rfc6845。
Wang, Lili, Zhaohui (Jeffrey) Zhang, and Nischal Sheth. OSPF Hybrid Broadcast and Point-to-Multipoint Interface Type. Request for Comments 6845. RFC Editor, 2013. doi:10.17487/rfc6845.
怀特、拉斯. 中间系统到中间系统 (IS-IS) 路由协议现场课程。视频。现场课程。思科出版社,2016 年。http ://www.ciscopress.com/store/intermediate-system-to-intermediate-system-is-is-routing-9780134465326 ?link=text&cmpid=2017_02_02_CP_RussWhiteVideo 。
White, Russ. Intermediate System to Intermediate System (IS-IS) Routing Protocol Live-Lessons. Video. LiveLessons. Cisco Press, 2016. http://www.ciscopress.com/store/intermediate-system-to-intermediate-system-is-is-routing-9780134465326?link=text&cmpid=2017_02_02_CP_RussWhiteVideo.
怀特、拉斯和肖恩·赞迪。“IS-IS 对 Openfabric 的支持。” 互联网草案。互联网工程任务组,2017 年 10 月。https ://datatracker.ietf.org/doc/html/draft-white-openfabric-03。
White, Russ, and Shawn Zandi. “IS-IS Support for Openfabric.” Internet Draft. Internet Engineering Task Force, October 2017. https://datatracker.ietf.org/doc/html/draft-white-openfabric-03.
拉斯·怀特、丹尼·麦克弗森和斯里哈里·桑利。实用BGP。马萨诸塞州波士顿:Addison-Wesley Professional,2004 年。
White, Russ, Danny McPherson, and Srihari Sangli. Practical BGP. Boston, MA: Addison-Wesley Professional, 2004.
1. 阅读“进一步阅读”部分中提供的 OpenFabric 文档。OpenFabric 专注于聚合还是汇总?OpenFabric 如何在不将网络划分为洪泛域的情况下减少控制平面信息量?
1. Read the OpenFabric documentation provided in the “Further Reading” section. Does OpenFabric concentrate on aggregation or summarization? How does OpenFabric reduce the amount of control plane information without dividing up the network into flooding domains?
2. 阅读“进一步阅读”部分中提供的胖树路由 (RIFT) 文档。RIFT 专注于聚合还是总结?描述 RIFT 用于汇总状态的一种技术以及 RIFT 如何处理聚合。
2. Read the Routing in Fat Trees (RIFT) documentation provided in the “Further Reading” section. Does RIFT concentrate on aggregation or summarization? Describe one technique that RIFT uses to summarize state and how RIFT handles aggregation.
3. 在 IS-IS 中配置重叠泛洪域什么时候有用?
3. When might it be useful to be able to configure overlapping flooding domains in IS-IS?
4. OSPF 中的“Stub”是指哪些类型的路由信息将始终在 ABR 处被阻止?
4. “Stub” in OSPF means what kinds of routing information will always be blocked at an ABR?
5. OSPF 中的“完全”是指哪些类型的路由信息将始终在 ABR 处被阻止?
5. “Totally” in OSPF means what kinds of routing information will always be blocked at an ABR?
6. OSPF 中的“不是这样”是指哪些类型的路由信息将始终在 ABR 处被阻止?
6. “Not so” in OSPF means what kinds of routing information will always be blocked at an ABR?
7. 阅读 RFC7196,使路线襟翼阻尼变得可用。本文档对指数退避方案提出了哪些问题,以及它提出了哪些解决方案来解决这些问题?
7. Read RFC7196, Making Route Flap Dampening Usable. What problems does this document pose for exponential backoff schemes, and what solutions does it propose to resolve these problems?
1 . 哈尔斯,“反对原子聚合。”
1. Hares, “Deprecate Atomic Aggregate.”
2 . Teixeira 等人,“IP 网络中热点路由变化的影响”。
2. Teixeira, et al., “Impact of Hot-Potato Routing Changes in IP Networks.”
了解网络传输和控制平面子系统的设计和操作是成为网络工程师的良好开端。设计涉及添加更多的点并将它们整合为一个整体。例如,网络设计还涉及以下任务:
Understanding the design and operation of the transport and control plane subsystems of a network is a good start toward being a network engineer. Design involves adding several more points and integrating them into a whole. For instance, network design also involves the following tasks:
• 将安全性融入网络的设计和运营中
• Building security into the design and operation of a network
• 使用网络作为工具(在有意义的情况下)来帮助保护连接的主机和应用程序
• Using the network as a tool (where it makes sense) to help secure the attached hosts and applications
• 网络设计中使用的设计模式,以及在何处以及如何应用这些模式
• The design patterns used in network design, and where and how to apply those patterns
• 全系统层面的弹性
• Resilience at a system-wide level
• 在多种可用技术之间进行选择,以解决应用程序和业务驱动程序提出的一系列问题(这涉及网络设计人员的领域)
• Choosing between the many technologies available to solve the set of problems presented by applications and business drivers (this reaches into the world of the network designer)
• 从长远来看,网络设计如何与战略业务利益相互作用,包括网络如何影响公司未来可能采取的方向(这涉及网络架构师的领域)
• How network design interacts with strategic business interests in the long term, including how the network impacts the directions the company may be able to take in the future (this reaches into the realm of the network architect)
网络设计和架构是非常广泛的领域,远远超出了本书的范围。事实上,与本书中的其他介绍性部分不同,专门针对这些更大领域的文章很少。请务必查阅本部分每章末尾的“进一步阅读”部分,以便在更大的范围内调出该领域的关键著作。
Network design and architecture are very broad fields, far outside the scope of this book. In fact, very little has been written specifically on these larger fields—unlike the other introductory sections in this book. Be sure to consult the “Further Reading” section at the end of each chapter in this part so that the key works in this space are called out in the larger scope.
设计领域严重依赖模型和抽象;设计往往更多的是“凭感觉”,以经验和广泛的知识为基础。尽管本书的其他部分已经介绍了它们,但在阅读这些有关设计的章节时,记住三个特定的模型很重要。
The field of design relies heavily on models and abstractions; design tends to be more of a “seat of the pants” affair, grounded in experience and a broad knowledge set. Although they have been covered in other areas of this book, it is important to keep three specific models in mind when reading these chapters on design.
许多抽象都意味着“完美”,因为它们完全包含单个系统中的信息。例如,传输控制协议 (TCP) 旨在提供看似跨网络的两台主机(或两个应用程序)之间的连接,但不保证数据包按顺序传送或根本不保证数据包传送。实际上,在许多情况下,作为 TCP 基础的互联网协议 (IP) 以及作为 IP 基础的物理链路的操作将在 TCP 的操作中直接可见。泄漏抽象法则适用于网络中几乎所有类型的抽象,从分层协议,到聚合可达性和拓扑信息,再到“在顶部构建覆盖网络”。
Many abstractions are meant to be “perfect,” in that they completely contain information within a single system. For instance, the Transmission Control Protocol (TCP) is designed to provide what appears to be a connection between two hosts (or two applications) across a network that does not guarantee packet delivery—in order, or at all. In reality, there are many situations where the operation of the Internet Protocol (IP)—which underlies TCP, and the physical links that underlie IP—will be directly visible in the operation of TCP. The law of leaky abstractions applies to almost every type of abstraction undertaken in a network, from layering protocols, to aggregating reachability and topology information, to building an overlay network “over the top.” A lot of complexity is driven into network protocols and design through various attempts to account for state leaking outside what should be a fairly watertight abstraction.
这最初在第 1 章“基本概念”中进行了解释,并在本书的其余部分中引用。许多设计艺术都在系统层面有意识地考虑这组权衡。许多设计变得过于复杂,因此过于脆弱,因为设计师倾向于关注目标而不是权衡。在决定部署覆盖(甚至部署哪种覆盖)时考虑这一点。部署覆盖层肯定会减少生成的底层和覆盖层中的状态量以及状态更改的速度。覆盖层可以在覆盖层注入额外的状态,以更有效地使用资源(例如,请参见第 25 章,“分解、超融合和不断变化的网络”,关于网络功能虚拟化)。但引入第二个控制平面和“顶层”传输层也创建了一个广泛且通常深入的交互表面。部署覆盖最终会增加还是降低整体复杂性?可以采取什么措施来减轻这种额外的复杂性?作为一个抽象概念,底层会在哪里泄漏?可能需要采取哪些步骤来阻止这种泄漏,这会增加多少复杂性?
This was originally explained in Chapter 1, “Fundamental Concepts,” and is referenced throughout the rest of the book. Much of the art of design is consciously considering this set of tradeoffs at a system level; many designs have become overly complex, and hence overly fragile, because designers tend to focus on goals rather than tradeoffs. Consider this in terms of the decision to deploy an overlay (or even what kind of overlay to deploy). Deploying an overlay certainly decreases the amount of state, and the speed at which state changes, in the resulting underlay and overlay. The overlay can inject additional state at the overlay layer to use resources more efficiently (as an example, see Chapter 25, “Disaggregation, Hyperconvergence, and the Changing Network,” on network function virtualization). But introducing a second control plane and an “over the top” transport layer also creates a broad, and often deep, interaction surface. Will deploying the overlay ultimately increase overall complexity, or reduce it? What can be done to mitigate this additional complexity? Where will the underlay, as an abstraction, leak? What steps might need to be taken to stem this leak, and how much more complexity will this add?
每一次设计讨论、每一个设计决策都需要通过提出诸如此类有关权衡的问题来推动。
Every design discussion, every design decision, needs to be driven by asking questions like these about tradeoffs.
CAP 定理在数据库设计领域广为人知并受到重视,但在网络设计领域并不经常考虑。事实上,CAP 告诉设计者,距离和处理是有时间成本的。数据源与最终数据目的地之间的距离和处理过程越远,数据到达目的地所需的时间就越长。当决策依赖于数据的存在时,这意味着距离和处理要求最终将减慢决策的速度。因此,理想的情况是决策被分发到最接近决策所需数据“存在”的位置——这就是所谓的辅助性原则。要记住的关键点是数据的来源;商业政策的根源是商业,因此,政策决策需要贴近业务。另一方面,寻找无环路径的路由信息的来源是网络设备,它们可以近乎实时地访问网络中的拓扑状态和可达性,因此根据拓扑变化做出决策是有意义的靠近实际转发流量的网络设备。
The CAP theorem is widely known and appreciated in the database design field, but is not often considered in the world of network design. In reality, CAP tells the designer that there is a time cost to distance and processing. The more distance and processing separating a data source from the ultimate data destination, the more time it will take for the data to get there. When decisions are dependent on the presence of data, this means that distance and processing requirements will ultimately slow down the pace at which decisions can be made. Hence, the ideal situation is where decisions are distributed to the point closest to where the data required to make the decision “lives”—this is known as the subsidiarity principle. The key point to remember is the source of the data; the source of business policy is the business, so decisions about policy need to be close to the business. On the other hand, the source of routing information to find loop-free paths is the network devices that have near-real-time access to the state of topology and reachability in the network, so it makes sense to put decisions based on topology changes close to the network devices that actually forward traffic.
第三部分中的章节假设每个设计领域都需要考虑所有这些因素;了解并应用它们将提高您作为网络设计师的能力。本部分的章节包括:
The chapters in Part III assume all of these factors need to be considered in each design realm; knowing and applying them will speed your capabilities as a network designer. The chapters in this part include:
•第 21 章:安全:更广泛的概述,讨论安全环境的不同组成部分、纵深防御、信息隐私和 OODA 循环
• Chapter 21: Security: A Broader Sweep, with discussions of the different components of the security environment, defense in depth, information privacy, and the OODA loop
•第 22 章:网络设计模式,讨论业务和网络设计之间的关系、网络所有权模型、瓶颈、分层设计、分层、常见网络拓扑和常规拓扑
• Chapter 22: Network Design Patterns, with discussions of the relationship between business and network design, network ownership models, choke points, hierarchical design, layering, common network topologies, and regular topologies
•第 23 章:冗余和弹性,讨论控制平面故障、控制平面融合、测量网络可用性、正常重启、服务中软件升级以及弹性模块化
• Chapter 23: Redundant and Resilient, with discussions of control plane failures, control plane convergence, measuring network availability, graceful restart, in service software upgrades, and modularization for resilience
•第 24 章:故障排除,讨论缩小过程、将网络分解为组件、如何模型、什么模型、故障排除工具、故障排除模型、半分割方法和技术债务
• Chapter 24: Troubleshooting, with discussions of the narrowing process, breaking networks into components, the how model, the what model, troubleshooting tools, models in troubleshooting, the half split method, and technical debt
在任何网络设计原则的讨论中,安全性常常被放在最后。它通常被认为是设计过程主要焦点的附加部分。然而,现代世界对数据来说是一个危险的地方,尤其是那些可能永久毁掉人们生活的数据。
Security is often placed last in any discussion of network design principles; it is often thought of as an add-on to the main focus of the design process. The modern world, however, is a dangerous place for data, particularly data that can ruin people’s lives permanently.
对于网络工程师来说,安全是一个非常广泛且重要的话题;以下几节将概述为什么会这样。
Security is a very broad and important topic for network engineers; the following sections will outline why this is so.
考虑这个简单的例子:许多电子设备都有一个指纹读取器(经常)用来代替密码。访问此类设备比密码或密码驱动的访问要简单得多;没有需要记住的密码。(理论上)它也更安全。您无法“窃取”某人的指纹。
Consider this simple example: many electronic devices have a fingerprint reader that is (often) used in the place of a password. Gaining access to such devices is much simpler than password or pin-driven access; there is no password to remember. It is also (theoretically) more secure. You cannot “steal” someone’s fingerprint.
或者你可以吗?有两条不太明显的攻击线。首先,您在日常生活中到处都留下了指纹。它出现在您手机的屏幕上、您进入的任何建筑物的门把手、汽车的门把手(或踏板车或自行车的车把)以及许多其他地方。人们很早就能够从各种服务中获取此类印刷品。您的指纹到底有多秘密?同样的问题也适用于任何用于识别您身份的外部可见身体特征:摄像头无处不在,并且至少其中一些摄像头会定期捕捉您身体的任何部分,用于识别您的身份。图 21-1说明了这个问题。
Or can you? There are two less than obvious lines of attack. First, you leave your fingerprint everywhere in everyday life. It is on the screen of your cell phone, the doorknob to any building you enter, the door handles on your car (or handlebars on your scooter or bike), and in many other places. People have long been able to lift such prints from a wide array of services. How much of a secret is your fingerprint, really? This same problem applies to any externally visible body characteristic used to identify you: cameras are everywhere, and at least some of them capture just about any part of your body used for identification on a regular basis. Figure 21-1 illustrates this problem.
第二条攻击线也许不那么明显,但可能更危险。没有系统将指纹本身存储为图像。相反,所有指纹系统都将指纹存储为每个指纹关键特征的数字化版本。当然,这些文件将被加密和保护,甚至可能只是存储在本地。然而,一旦指纹数据被在线获取,所有的安全赌注都将落空。无论保护得多么好,通过公共互联网传输的数据在某些时候都有一定比例的概率被盗。数据泄露很常见;例如,以下是 2016 年的一些违规示例:
The second line of attack is, perhaps, less obvious but maybe more dangerous. No system stores fingerprints as images, per se. Rather, all fingerprint systems store fingerprints as a digitized version of the key characteristics of each fingerprint. Certainly such files will be encrypted and protected, and perhaps even just stored locally. Once fingerprint data is taken online, however, all security bets are off. No matter how well protected, data moved across the public Internet has some percentage probability of being stolen at some point. Data breaches are common; for instance, here are a few sample breaches from 2016:
• FACC 是一家轻质复合材料制造商,是网络盗窃的受害者,价值至少为 5,450 万美元。1
• FACC, a manufacturer of lightweight composites, was the victim of cyber theft worth at least $54.5 million.1
• 佛罗里达大学,公开了约 63,000 名学生和教职员工的记录。2
• The University of Florida, exposing the records of about 63,000 students and staff.2
• FBI 公开了约 20,000 名员工的联系信息。3
• The FBI, exposing the contact information of about 20,000 employees.3
• 美国国税局,公开约220,000 名纳税人的信息。4
• The United States Internal Revenue Service, exposing the information of about 220,000 tax payers.4
• 加州大学伯克利分校,公开约 80,000 名学生、教师和校友的信息。5
• The University of California at Berkley, exposing information of about 80,000 students, faculty, and alumni.5
• Premier Health Care,公开有关200,000 名患者的信息。6
• Premier Health Care, exposing information about 200,000 patients.6
• Verizon Enterprise Services,可能会泄露有关 150 万客户的信息。7
• Verizon Enterprise Services, potentially exposing information about 1.5 million customers.7
• 犹他州盐湖城市公开了约14,200 人的信息。8
• The City of Salt Lake City, Utah, exposing information about 14,200 people.8
• Tidewater Community College,公开了 3,000 名员工的信息。9
• Tidewater Community College, exposing information about 3,000 employees.9
• 菲律宾的投票系统暴露了5500 万公民的信息。10
• The voting system of the Philippines, exposing information about 55 million citizens.10
• 雅虎,暴露了 5 亿至 10 亿用户的信息。11
• Yahoo, exposing the information of between 500 million and 1 billion users.11
每年都会发生数百(或数千)起此类违规事件,其中许多没有引起广大公众的注意,或者根本没有报告。一旦指纹像信用卡和其他信息一样被存储,大量指纹信息被盗只是时间问题。当然,并不是世界上每个人的每个指纹都会在这种泄露中被盗,但这对于那些指纹被盗的人来说只是小小的安慰。
There are hundreds (or thousands) of such breaches each year, many of which escape the notice of the wider public or are not reported at all. Once fingerprints are stored like credit card and other information, it is a matter of time before large caches of fingerprint information are stolen. Of course, not every fingerprint for every person in the world will be stolen in such a breach, but this will be small comfort to those whose fingerprints were stolen.
要了解安全世界,了解一些基本术语非常重要。图 21-2说明了第一组重要定义。
To understand the world of security, it is important to understand some basic terms. Figure 21-2 illustrates the first set of important definitions.
在图21-2中,从左到右:
In Figure 21-2, working from left to right:
•威胁行为者或攻击者是发起攻击的个人或组织。威胁行为者的身份可以帮助您了解动机、技能水平和可能的攻击计划。
• The threat actor, or attacker, is the individual or organization initiating the attack(s). The identity of the threat actor can help you understand motivations, skill level, and possible plan of attack.
• 该漏洞利用流程或工具来利用漏洞,例如手动输入特定字符串、运行脚本或软件等。
• The exploit takes advantage of the vulnerability using a process or tool, such as manually entering a specific string, running a script or piece of software, etc.
•攻击或威胁是威胁行为者利用漏洞进行的潜在或实际攻击。
• The attack or threat is the potential or actual attack performed by the threat actor using the exploit.
•漏洞是防御中的潜在弱点,可被利用来实现目标。例如,数据包过滤器丢失或配置错误、网络内有权访问目标系统的人可以通过社交工程以某种方式提供访问权限,或者系统代码中的缺陷允许威胁行为者以某种方式攻击系统。方式。
• Vulnerabilities are potential weak spots in the defense that can be exploited to achieve an objective. For instance, a missing or misconfigured packet filter, someone inside the network with rights to the target system who can be social-engineered to provide access in some way, or a defect in a system’s code allowing a threat actor to attack a system in some way.
•攻击面是威胁行为者可以通过某种方式访问的系统、端口号、应用程序等的总集。
• The attack surface is the total set of systems, port numbers, applications, etc., the threat actor has access to in some way.
•资产是威胁行为者想要访问或阻止访问(在拒绝服务攻击的情况下)的网络元素和信息。
• Assets are the network elements and information within the network that the threat actor would either like to access or prevent access to (in the case of a denial of service attack).
•风险是攻击的潜在负面结果,例如不良宣传(以麦克风为代表)、重大业务后果(以违规报告为代表),甚至是业务本身的彻底失败。风险通常被量化为潜在的时间影响。
• Risks are the potential negative results of an attack, such as bad publicity (represented by the microphone), a major business consequence (represented by the breach report), or even a complete failure of the business itself. Risk is often quantified as potential times impact.
从网络角度来看,安全问题空间可以分为三大领域:
The security problem space, from a network perspective, can be divided into three broad areas:
• 用户和流程如何访问他们完成工作所需的数据?
• How can users and processes access the data they need to do their jobs?
• 通过网络传输并存储在连接到网络的设备上的信息如何保持机密?
• How can the information carried over the network and stored on devices connected to the network remain confidential?
• 网络如何保持可访问?许多攻击者希望通过删除网络作为可用资源来破坏企业或组织;这称为拒绝服务 (DoS) 攻击。
• How can the network remain accessible? Many attackers would like to disrupt a business or organization by removing the network as a usable resource; this is called a denial of service (DoS) attack.
以下部分将考虑针对每个问题的各种可能的解决方案。之后,将考虑安全领域中的一些解决方案示例,然后将讨论考虑网络安全的有用模型,即观察、定向、决策、行动 (OODA) 循环。
The following section will consider a wide scope of possible solutions to each of these problems. After this, some examples of solutions in the security space will be considered, and then a useful model for considering network security, the Observe, Orient, Decide, Act (OODA) loop, will be discussed.
自第一次网络入侵(可能发生在第一个网络运行后的第二天)以来,已经撰写了许多书籍、文章和研究论文来讨论不同的安全要素。如果您有兴趣了解有关安全性的更多信息,而不是本章本节和后面各节中包含的内容,那么“进一步阅读”部分将会有所帮助。
Many books, articles, and research papers have been written addressing different elements of security since the first network break-in, which probably happened the day after the first network was operational. The “Further Reading” section will be helpful if you are interested in learning more about security than what is contained in this and later sections in this chapter.
本节将首先深入探讨防御的概念,然后考虑三个广泛的安全解决方案空间:访问控制、数据保护和服务可用性保证。
This section will begin by looking at the concept of defense in depth and then consider three broad security solution spaces: access control, data protection, and service availability assurance.
第一个也是最重要的解决方案更加概念化,而不是面向工具或方法。深度防御是指让多个重叠的防御层以给攻击者带来多重挑战的方式交互的概念。图 21-3说明了一组可能的防线。
The first, and most important, solution is more conceptual than tool or method oriented. Defense in depth is the concept of having multiple, overlapping layers of defenses interacting in a way that poses multiple challenges to an attacker. Figure 21-3 illustrates one possible set of lines of defense.
在图21-3中,有多条防线,包括
In Figure 21-3, there are a number of lines of defense, including
• 网络路由边缘的数据包过滤器和访问控制;这些是“基本”控制,只是检查用户是否有权访问一般网络,阻止一些基本(明显的)数据包流,甚至限制网络外部主机传输数据的速率。
• Packet filters and access controls at the routed edge to the network; these are “basic” controls that just check to see if a user is authorized to access the network in general, block some basic (obvious) packet flows, and even limit the rate at which hosts outside the network can transfer data.
• 在网络路由边缘进行路由验证;这将有助于防止来自被劫持的地址空间的访问。
• Route validation at the routed edge to the network; this will help prevent access from hijacked address space.
• 整个网络的一般遥测,这将指示最高用量者,提供有关最常见源/目标对的信息,记录利用率的异常峰值等。
• General telemetry throughout the network, which will indicate top talkers, provide information on the most common source/destination pairs, note unusual spikes in utilization, etc.
• 中间盒C 处的状态数据包过滤器,如果存在现有连接,则仅允许某些端口上的流量进入网络。
• Stateful packet filters at the middlebox C, which will only allow traffic into the network on some ports if there is an existing connection.
• 在多个地点部署渗透监控;当发生特定的访问模式时,例如从包含敏感用户信息的数据库向网络外部的目的地传输大量数据,这将发出警报。
• Exfiltration monitoring deployed in several locations; this will raise an alert when specific access patterns occur, such as large amounts of data being transferred toward a destination outside the network from a database containing sensitive user information.
• 对各个服务和/或服务器(例如G)的访问控制,确保D 处的用户有权访问各个资源。
• Access control on individual services and/or servers, such as G, which ensures the user at D is authorized to access individual resources.
尽管这些系统都无法阻止入侵者破坏您的网络或数据,但将它们结合使用可以提供相当有效的防御系统,以抵御许多不同形式的攻击。首先,威胁行为者在一层上移动所花费的时间使网络运营商有机会通过更具体的控制来发现和应对攻击。其次,系统和过滤器的分层形成了一种“网格层”,流量必须通过它才能到达重要资源。第一网格层未阻止的攻击可能会被第二网格层阻止,依此类推。
Although none of these systems will stop an intruder from breaching your network or data, used together they can provide a fairly effective defense system against many different forms of attack. First, the time a threat actor spends moving through one layer gives the network operator an opportunity to discover and counter the attack with more specific controls. Second, the layering of systems and filters forms a kind of “layer of grids” through which traffic must pass to reach important resources. An attack not blocked by the first grid layer may be blocked by the second, etc.
了解网络中整套系统的深入防御态势并将每个可用资源用作潜在的防御系统是网络设计领域的重要技能。
Understanding the in-depth defensive posture of the complete set of systems in a network and using every available resource as a potential defensive system are important skills in the network design space.
访问控制试图确保
Access control tries to ensure
• 用户(或进程)是他们声称的身份——以验证身份。这通常称为身份验证。
• Users (or processes) are who they claim to be—to verify identity. This is often called authentication.
• 用户(或进程)能够访问他们试图访问的数据,或者更确切地说,无论他们是否被授权使用特定服务或访问特定数据。
• Users (or processes) are able to access the data they are attempting to access, or rather whether or not they are authorized to use a particular service or access a particular piece of data.
• 记录有关用户操作的信息,以便可用于追溯故障和违规行为。这通常称为会计。
• Information about user actions are recorded so they can be used to trace back to failures and breaches. This is generally called accounting.
访问控制系统通常称为 AAA 系统,因为有 3 个 A:身份验证、授权和计费。这些系统通常是专门的应用程序,通过远程身份验证拨入用户服务 (RADIUS) 等协议与设备进行交互,以确保用户正确登录,并且应用程序拥有足够的信息来确保只有有效且授权的用户才能访问信息和服务。
Access control systems are often called AAA systems, because of the three A’s: authentication, authorization, and accounting. These systems are often specialized applications, interacting with devices through a protocol such as Remote Authentication Dial-In User Service (RADIUS) to ensure users are properly logged in and applications have enough information to ensure that only valid and authorized users are accessing information and services.
访问控制可以在系统中的许多不同地方实现;例如:
Access control can be implemented in many different places in a system; for instance:
• 在用户连接到网络之前,或者更确切地说,在用户可以登录到能够获取 Internet 协议 (IP) 地址、连接到上游交换机、连接到无线网络等的设备之前。
• Before the user connects to the network, or rather before the user can log on to a device that is able to obtain an Internet Protocol (IP) address, connect to an upstream switch, connect to a wireless network, etc.
• 在用户连接到网络之后但在用户可以访问任何服务之前。
• After the user connects to the network but before the user can access any service.
• 在用户连接到网络之后和用户访问每项单独服务之前。
• After the user connects to the network and before the user accesses each individual service.
这三种不同的选项可以组合起来;例如,用户可能会被要求在连接到网络之前提供用户名和密码,然后在访问网络上的任何服务之前再次提供用户名和密码,然后在访问特定的、更严格限制的系统或信息之前再次提供用户名和密码。
These three different options can be combined; for instance, a user may be asked to provide a username and password before connecting to the network, and again before accessing any service on the network, and then again before accessing particular, more highly restricted, systems or information.
考虑对保存机密文件的保险箱进行保护。保险箱是否有可能以某种方式被破坏?如果有人向保险箱所在的建筑物发出炸弹威胁怎么办?建筑物的居住者会收集任何必要的设备和信息(包括保险箱的内容)并离开建筑物吗?或者,如果似乎获得授权的人打电话索要数据怎么办?保险箱被破坏类似于访问控制故障,而炸弹威胁或将数据驱动到街上的信息请求类似于通过网络拉取信息的请求。
Consider the protection around a safe holding classified documents. Is it possible the safe might be breached in some way? What if someone calls in a bomb threat to the building where the safe is housed? Will the occupants of the building gather up any necessary equipment and information, including the contents of the safe, and leave the building? Or what if someone who appears to be authorized calls and asks for the data? The safe being breached is similar to an access control failure, and the bomb threat or request for information that drives the data out into the street is similar to a request pulling the information across the network.
最终,必须有一道防线来保护此类情况下的数据;系统和应用程序将被破坏,并且将通过网络请求数据。加密通常是这些情况的最后一道防线。
Ultimately, there must be a line of defense to protect data in these sorts of situations; systems and applications will be breached, and data will be requested over the network. Encryption is generally the last line of defense for these situations.
加密采用信息块(明文)并使用某种形式的数学运算对其进行编码以模糊文本,从而产生密文。要恢复原始明文,必须反转数学运算。大多数加密都是基于分解由两个或多个质因数组成的大整数所涉及的难度。根据密钥(质因数)和明文的某些部分计算整数,从而产生密文。要从密文恢复明文,则过程相反;密钥用于求密文中大整数的因数,最终计算出原始明文。
Encryption takes a block of information (the plaintext) and encodes it using some form of mathematical operation to obscure the text, resulting in a ciphertext. To recover the original plaintext, the mathematical operations must be reversed. Most encryption is based on the difficulty involved in factoring a large integer composed of two or more prime factors. An integer is calculated based on the key (a prime factor) and some portion of the plaintext, resulting in the ciphertext. To recover the plaintext from the ciphertext, the process is reversed; the key is used to find the factor of the large integers in the ciphertext, ultimately calculating the original plaintext.
广泛使用的加密有两种:公钥和私钥。在公钥密码学(更准确地称为非对称密码学)中,有两个因素或密钥;如果使用其中一个密钥对明文进行加密,则可以使用第二个密钥对其进行解密。这很有用,因为它允许公开发布两个密钥之一。在私钥密码术中,更准确地称为对称密钥密码术,使用相同的密钥来加密和解密明文;因此发送者和接收者必须共享相同的密钥才能进行通信。
There are two kinds of widely used encryption: public key and private key. In public key cryptography, more properly called asymmetric cryptography, there are two factors or keys; if the plaintext is encrypted using one of the keys, it can be unencrypted using the second key. This is useful because it allows one of the two keys to be published publicly. In private key cryptography, more properly called symmetric key cryptography, the same key is used to encrypt and unencrypt the plaintext; hence the sender and receiver must share the same key to communicate.
公钥和私钥加密通常一起使用以形成单个系统。用图21-5来说明。
Public and private key cryptography are often used together to form a single system. Figure 21-5 is used to illustrate.
在图 21-5中:
In Figure 21-5:
1. 假设 A 开始该过程。A 将使用 B 的公钥加密一个随机数,或者更确切地说是一个大的随机数。由于随机数已使用 B 的公钥加密,因此理论上只有 B 可以解密该随机数,因为只有 B 应该知道 B 的私钥。
1. Assume A begins the process. A will encrypt a nonce, or rather a large random number, using B’s public key. Because the nonce has been encrypted with B’s public key, in theory only B can unencrypt the nonce, as only B should know B’s private key.
2. B 在解密随机数时,现在将向 A 发送一些新的随机数。这可能包括 A 的原始随机数,或 A 的原始随机数加上一些其他信息。关键是,A 必须确切地知道 B 收到的原始消息(包括 A 的随机数),而不是充当 B 的其他系统。这是由 B 保证的,包括使用其公钥加密的一些信息,如 B是唯一能够解密它的系统。
2. B, on unencrypting the nonce, will now send some new nonce to A. This may include A’s original nonce, or A’s original nonce plus some other information. The point is that A must know, for certain, the original message, including A’s nonce was received by B—and not some other system acting as B. This is ensured by B including some piece of information encrypted using its public key, as B is the only system able to unencrypt it.
3. A 和 B 使用此时交换的随机数和其他信息来计算私钥,然后使用该私钥对两个系统之间传输的信息进行加密/解密。
3. A and B, using the nonces and other information exchanged to this point, will calculate a private key, which is then used to encrypt/unencrypt information transferred between the two systems.
这里列出的步骤有些幼稚;有更好、更安全的系统,例如互联网密钥交换(IKE)协议;有关该领域的资源,请参阅“进一步阅读”部分。为什么不一直使用非对称(或公钥)加密呢?因为使用非对称密钥加密的计算成本比使用对称加密高得多。
The steps outlined here are somewhat naive; there are better, more secure, systems, such as the Internet Key Exchange (IKE) protocol; see the “Further Reading” section for resources in this area. Why not just use asymmetric (or public key) cryptography all the time? Because the computational costs of using asymmetric key cryptography are much higher than using symmetric cryptography.
数据保护中需要关注的第二个领域是数据耗尽。当然,还有许多其他术语,但总体思路是通信模式中的漏洞。例如,假设一家银行为特定数据库表配置了自动备份;当表中账户余额发生特定金额变化时,自动开始备份。这似乎是一种完全合理的备份作业,但它确实涉及一定量的数据消耗。如果威胁行为者将备份与帐户价值的变化放在一起,他就会具体知道帐户活动的模式是什么。足够多的此类线索可以开发成一整套攻击计划。
A second area to be concerned about in data protection is data exhaust. There are many other terms for this, of course, but the general idea is vulnerabilities in the communication patterns. For instance, assume a bank configures an automated backup for a particular database table; when the balances in the account held in the table change by a particular amount, the backup is kicked off automatically. This might seem like a perfectly reasonable sort of backup job, but it does involve some amount of data exhaust. If a threat actor puts the backup together with the change in account value, he will know specifically what the pattern of account activity is. Enough clues of this sort can be developed into an entire set of attack plans.
网络工程师如何防止数据耗尽?没有真正好的方法来防止通过此类行为无意中将信息泄露到公共领域;即使在安全领域,抽象泄漏法则也适用。您能做的最好的事情就是意识到此类问题,可能以与攻击者相同的方式分析您的网络,并记录可能用于攻击您的防御系统的任何模式。
How can network engineers protect against data exhaust? There are no real good ways to protect against leaking information unintentionally into the public domain through such actions; even in security, the law of leaky abstractions applies. The best you can do is to be aware of such problems, potentially profiling your network the same way an attacker would, and noting any patterns that might be used against your defense system.
分布式拒绝服务 (DDoS) 攻击呈上升趋势,2016 年底最大的攻击达到每秒 1 太比特以上,12使用被劫持的物联网 (IoT) 设备(称为僵尸网络)。图 21-6说明了构建此类攻击的一种方法。
Distributed denial of service (DDoS) attacks are on the rise, with the largest reaching over 1 terabits per second in late 2016,12 using hijacked Internet of Things (IoT) devices, called a botnet. Figure 21-6 illustrates one way such an attack can be built.
图 21-6中的过程在攻击之前开始,创建一个僵尸网络作为平台。构建僵尸网络主要是让尽可能多的设备感染病毒,从而允许控制器指示设备根据需要将数据包流发送到某个 IP 地址。病毒旨在感染各种设备,包括物联网设备(如灯泡、冰箱、摄像机、电视机等)、个人电脑、手机和网络服务器(通常在虚拟机内运行),最终允许僵尸网络控制器在某种有限的意义上控制它们。这种僵尸网络可以很容易地按小时租用。
The process in Figure 21-6 begins before the attack, with the creation of a botnet to use as a platform. Building botnets is mostly a matter of getting as many devices as possible infected with a virus, allowing a controller to instruct the device to send a stream of packets to some IP address on demand. Viruses are designed to infect a wide range of devices, including IoT devices (like light bulbs, refrigerators, video cameras, television sets, etc.), personal computers, cell phones, and web servers (which normally run inside a virtual machine), ultimately allowing them to be controlled in some limited sense by the botnet controller. Such botnets can be rented by the hour fairly easily.
1. 僵尸网络控制器向每个设备(可能有数十万个)发送命令,将一系列数据包发送到一组知名服务器。任何类型的托管广泛使用的公共服务、具有大量带宽和处理能力的服务器都可以;最喜欢的是域名系统 (DNS) 和网络时间协议 (NTP) 服务器。
1. The botnet controller sends a command for each of the devices, potentially hundreds of thousands of them, to send a series of packets to a set of well-known servers. Any sort of server hosting a widely used public service with a lot of bandwidth and processing power will do; favorites are Domain Name System (DNS) and Network Time Protocol (NTP) servers.
2. 僵尸网络设备向在攻击中用作反射器的每台服务器发送对某些信息的请求。通常,这是对 DNS 解析的请求,或存储在 DNS 表中的大型文本记录,或类似的内容。请求源被伪造或设置为目标的 IP 地址。
2. The botnet devices send requests for some piece of information to each server being used as a reflector in the attack. Typically, this is a request for a DNS resolution, or a large text record stored in the DNS table, or something similar. The source of the request is forged or set to the target’s IP address.
3. 服务器以大量数据响应请求,然后将数据发送到目标设备。某些资源(例如可用带宽、可用传输控制协议 (TCP) 连接缓冲区或其他规模有限的资源)被消耗,从而阻止服务器正常运行(例如阻止 Web 服务器向访问者提供网页)。
3. The servers respond to the request with large amounts of data, which is then sent to the target device. Some resource, such as available bandwidth, available Transmission Control Protocol (TCP) connection buffers, or something else with a limited scale, is consumed, preventing the server from operating properly (such as preventing a web server from serving web pages to visitors).
为什么威胁行为者要构建并发起此类攻击?原因有很多,包括
Why do threat actors build and launch these kinds of attacks? There are a number of reasons, including
• 向企业勒索钱财。针对知名目标的大规模攻击对于勒索尤其有效;威胁行为者可以向数百家公司发送电子邮件,内容如下:“您是否看到有关大型 DDoS 攻击?那是我。如果你不付给我(一大笔钱),你就是下一个。” 如果目标公司认为自己比知名目标拥有更少的资源和技能,它可能会付费,而不是试图防御如此大规模的攻击。
• To extort money from businesses. A large-scale attack against a well-known target is particularly effective for extortion; a threat actor can send an email to hundreds of companies saying something like: “Did you see the news about the big DDoS attack? That was me. If you do not pay me (some large amount of money), you will be next.” If the targeted company perceives it has fewer resources and skills than the well-known target, it may pay rather than trying to defend itself against such a large attack.
• 提出政治观点。有些攻击针对威胁行为者在政治上不同意的组织,例如未能支持特定事业的公司、竞争对手的政党等。
• To make a political point. Some attacks are targeted at organizations that the threat actor disagrees with politically, such as a company failing to support a specific cause, a rival political party, etc.
• 打倒竞争对手。一些威胁行为者出售在一段时间内关闭竞争对手网站的服务,以使公司难堪或将用户吸引到竞争对手。
• To bring down a competitor. Some threat actors sell the service of taking down a rival’s website for some period of time, in order to embarrass the company or drive users to a competitor.
• 在发生其他攻击时分散公司安全团队的注意力。DDoS 攻击通常是一种有用的佯攻,可以分散公司安全团队的注意力,同时在受害者的网络中创建某种形式的后门或其他漏洞。
• To distract the security team at a company while some other attack is occurring. DDoS attacks are often a useful feint to distract the corporate security team while some form of back door or other vulnerability is created in the victim’s network.
• 因为他们可以。有些人似乎只是喜欢制造破坏,或者他们认为这是他们成名的唯一途径。
• Because they can. Some people just seem to enjoy wreaking havoc, or they believe it is the only way they will ever become famous.
有多种方法可以保护系统免受 DDoS 攻击,其中许多方法可以(并且应该)并行使用。
There are a number of ways to defend systems against DDoS attacks, many of which can (and should) be used in parallel.
可以对主机操作系统进行一些修改,使服务器能够承受 DDoS 攻击的流量,同时继续提供服务(尽管可能更慢)。这些修改主要涉及提供更多资源、更快地关闭不完整的连接请求、更快地老化当前未使用的缓存信息以及其他措施。
Some modifications can be made to host operating systems that will allow the server to withstand the traffic of a DDoS attack while continuing to provide service (though perhaps more slowly). These modifications primarily relate to making more resources available, closing incomplete connection requests more quickly, more quickly aging out cached information that is not currently being used, and other measures.
服务器协议实现,甚至(在某种程度上)边缘路由器,都可以阻止半开放和格式错误的会话。正常的传输控制协议 (TCP) 会话设置有多个步骤:
Server protocol implementations, and even (to some degree) on-edge routers, can block half-open and malformed sessions. A normal Transmission Control Protocol (TCP) session setup has multiple steps:
1. 客户端通过向服务器发送同步 (SYN) 数据包来请求连接。
1. The client requests a connection by sending a synchronize (SYN) packet to the server.
2. 服务器回复连接请求的确认 (SYN-ACK)。
2. The server replies with an acknowledgment of the connection request (SYN-ACK).
3. 客户端用ACK确认收到SYN-ACK;三次握手完成,可以通过会话传输数据。
3. The client acknowledges receipt of the SYN-ACK with an ACK; the three-way handshake is complete, and data can be transmitted over the session.
在某些 TCP DDoS 攻击中,客户端将发送 SYN,但从不确认 SYN-ACK。这称为半开放会话。开放端口代表服务器上消耗的资源,而攻击者的成本却非常低。许多路由器和状态数据包检测设备可以丢弃半开 TCP 会话。
In some TCP DDoS attacks, the client will send the SYN but never acknowledge the SYN-ACK. This is called a half-open session. Open ports represent consumed resources on the server while costing the attacker very little. Many routers and stateful packet inspection devices can drop half-open TCP sessions.
对于基于 TCP 的 DDoS 攻击,另一种选择是服务器将处理工作推回攻击中使用的系统。实现此目的的一种方法是允许服务器使用格式错误的 SYN-ACK 响应 TCP SYN 消息。如果客户端运行设计良好、未经修改的 TCP 实现,这将导致攻击中使用的系统花费处理和内存资源,将错误报告回服务器。这种额外的负载将减少僵尸网络可用于发起攻击的带宽量和处理能力。
Another option in the case of TCP-based DDoS attacks is for the server to push processing work back onto the systems used in the attack. One way to do this is to allow the server to respond to TCP SYN messages with a malformed SYN-ACK. If the client is running a well-designed, unmodified TCP implementation, this will cause the system used in the attack to spend processing and memory resources reporting the error back to the server. This additional load will reduce the amount of bandwidth and processing power the botnet has available to pursue the attack.
这些响应中很少有可以作为对基于无会话传输协议(例如用户数据报协议(UDP))的攻击的响应。
Very few of these responses will work as responses to attacks based on sessionless transport protocols, such as the User Datagram Protocol (UDP).
许多操作系统提供了限制特定时间范围内传入连接请求数量的功能(通常类似于每秒 x 百/千个连接请求)。一些网络设备将这一概念扩展到控制平面保护,这限制了从设备传输信息的速率。数据平面进入控制平面进行处理。这些方案确实节省了资源,但通常要付出代价:好的和坏的流量都会被丢弃。可以应用方案来仅丢弃不良流量,但定义不良流量很困难。IP 数据包中没有邪恶位。
Many operating systems offer the ability to limit the number of incoming connection requests over a specific time scale (usually something like x hundred/thousand connection requests per second). Some network devices extend this concept to control plane protection, which limits the rate at which information is transmitted from the data plane into the control plane for processing. These schemes do save resources but often at a cost: both good and bad traffic are dropped. Schemes may be applied to drop just bad traffic, but defining bad traffic is difficult. There is no evil bit in the IP packet.
对于拥有非常大(或分散)边缘的运营商,可以使用路由控制在尽可能多的网络入口点之间传播 DDoS 流量。例如,1T 攻击如果分布在 1,000 个服务器/网络边缘入口点,则在每个服务器/入口点变成 1k 数据流,可以忽略不计。用图21-7来说明。
For operators with a very large (or dispersed) edge, it is possible to use routing controls to spread the DDoS traffic among as many entry points into the network as possible. For instance, a 1T attack, if spread across 1,000 servers/network edge entry points, becomes a 1k data stream at each server/entry point, which can be ignored. Figure 21-7 is used to illustrate.
在图 21-7中,AS65000 有六个入口点,每个入口点为一个单独的服务器(或一组服务器)提供数据。
In Figure 21-7, AS65000 has six entry points, each feeding a separate server (or set of servers).
假设攻击者将 IoT 设备分散在 AS65002 中,用于发起攻击。由于 AS65002 内部的策略,DDoS 攻击流被转发到 AS65001,然后转发到 A 和 B。关闭这两条链路很容易,迫使流量分散在五个条目而不是两个条目(B、C、D) 、E 和 F)。如果将流量分散到五个入口点,则可能会吃掉流量。现在,每个流量的大小不到原始 DDoS 攻击的一半,可能在这些入口点处的服务器丢弃 DDoS 流量的范围内。
Assume the attacker has IoT devices scattered throughout AS65002 that are being used to launch an attack. Due to policies within AS65002, the DDoS attack streams are forwarded into AS65001, and thence to A and B. It would be easy to shut down these two links, forcing the traffic to disperse across five entries rather than two (B, C, D, E, and F). If you split the traffic among five entry points, it may be possible to eat the traffic. Each flow is now less than one-half the size of the original DDoS attack, perhaps within the range of the servers at these entry points to discard the DDoS traffic.
然而,这种响应也对攻击者有利。现在,任何直接连接到 AS65001 的客户(例如 G)都需要通过 AS65002(攻击者从 AS65002 发起 DDoS),并进入相同的五个入口点。您认为 G 的客户在这种情况下会高兴吗?最有可能的答案是不是很。
However, this kind of response plays into the attacker’s hand, as well. Now any customer directly attached to AS65001, such as G, will need to pass through AS65002, from whence the attacker has launched the DDoS, and enter into the same five entry points. How happy do you think the customer at G would be in this situation? The most likely answer is not very.
还有其他选择吗?与其关闭这两个链接,不如尝试减少通过这些链接的流量并保留它们,这样更有意义。简而言之,如果 DDoS 攻击减少了网络边缘的可用带宽总量,那么减少边缘的可用带宽作为响应就没有多大意义。相反,您想要做的是重新分配进入每个边缘的流量,以便您有更好的机会允许现有服务器丢弃 DDoS 攻击。
Is there another option? Instead of shutting down these two links, it would make more sense to try to reduce the volume of traffic coming through the links and leave them up. To put it more shortly, if the DDoS attack is reducing the total amount of available bandwidth you have at the edge of your network, it does not make a lot of sense to reduce the available amount of bandwidth at your edge in response. What you want to do, instead, is reapportion the traffic coming in to each edge so you have a better chance of allowing the existing servers to discard the DDoS attack.
一种可能的解决方案是预先考虑从服务实例之一通告的任播地址的自治系统 (AS) 路径。在这里,您可以在 C 的路由通告中添加一个前缀,并检查攻击流量是否在三个站点上分布得更均匀。然而,这并不总是有效的解决方案。此外,如果这是选播服务,则地址空间不能分解为更小的位。那么还能做什么呢?
One possible solution is to prepend the Autonomous System (AS) path of the any-cast address being advertised from one of the service instances. Here, you could add one prepend to the route advertisement from C and check to see if the attack traffic is spread more evenly across the three sites. However, this isn’t always an effective solution. Further, if this is an anycast service, the address space cannot be broken up into smaller bits. So what else can be done?
有一种方法可以使用边界网关协议 (BGP) 来实现此目的:使用社区来限制 A 和 B 所通告的路由的范围。例如,您可以首先将到受到攻击的目的地的路由通告到 AS65001: NO_PEER 社区。鉴于 AS65002 是一个中转 AS(假设用于本练习),AS65001 将接受来自 A 和 B 的路由,但不会将它们通告给 AS65002。这意味着 G 仍然能够通过 AS65001 到达 A 和 B 后面的目的地,但攻击流量仍然会分散在五个入口点,而不是两个。您还可以在这里使用其他机制;具体来说,某些提供商允许您设置一个社区,告诉他们不要通告通往特定 AS 的路由,无论该 AS 是对等方还是客户。您应该就此咨询您的提供商,因为每个提供商都使用一组不同的社区,其格式略有不同;您的提供商可能会向您指出一个解释其格式的网页。
There is a way to do this with the Border Gateway Protocol (BGP): using communities to restrict the scope of the routes being advertised by A and B. For instance, you could begin by advertising the routes to the destinations under attack toward AS65001 with the NO_PEER community. Given that AS65002 is a transit AS (assume it is for this exercise), AS65001 would accept the routes from A and B but would not advertise them toward AS65002. This means G would still be able to reach the destinations behind A and B through AS65001, but the attack traffic would still be dispersed across five entry points, rather than two. There are other mechanisms you could use here; specifically, some providers allow you to set a community telling them not to advertise a route toward a specific AS, whether the AS is a peer or a customer. You should consult with your provider about this, as every provider uses a different set of communities, formatted in slightly different ways; your provider will probably point you to a web page explaining its formatting.
如果 NO_PEER 不起作用,则可以使用 NO_ADVERTISE,它会阻止将受攻击的目标通告给任何AS65001 的任何类型的连接。如果 G 使用默认路由到达整个 Internet,则它很可能仍然能够使用从 AS65001 到 A 和 B 的连接。
If NO_PEER does not work, it is possible to use NO_ADVERTISE, which blocks the advertisement of the destinations under attack to any of AS65001’s connections of whatever kind. G may well still be able to use the connections to A and B from AS65001 if it is using a default route to reach the Internet at large.
可以通过一组脚本自动执行此反应,但与往常一样,对此类脚本保持严格控制非常重要。需要提醒人类做出使用这些社区或继续使用这些社区的决定;误报很容易导致真正的问题。
It is possible to automate this reaction through a set of scripts, but as always, it is important to keep a short leash on such scripts. Humans need to be alerted to either make the decision to use these communities or to continue using these communities; it is too easy for a false positive to lead to a real problem.
由于(至少某些)攻击流量源自未使用和/或不可路由的地址空间(称为 bogon 路由),因此过滤这些路由可用于阻止一定量的 DDoS 攻击流量。
Since (at least some) attack traffic is originated from unused and/or unroutable address space (called bogon routes), filtering these routes can be useful in blocking some amount of DDoS attack traffic.
假设 A 感染了病毒,成为僵尸网络的一部分;在某些时候,主机将被配置为向公共服务器发送一些数据包流,然后这些数据包将被反射到目标计算机。僵尸网络可以指示主机使用其实际地址,但这对于某些形式的攻击不起作用。例如,DNS 服务器将响应包含 DNS 请求的数据包中的源地址。
Assume A is infected with a virus, making it part of a botnet; at some point, the host is going to be configured to send some stream of packets to a public server, which will then be reflected to a target machine. The botnet could instruct the host to use its actual address, but this will not work for some forms of attack. For instance, a DNS server will respond to the source address in the packet containing the DNS request.
攻击者首选的方法是:指示 A 使用欺骗或劫持的地址。例如,僵尸网络控制器可能指示 A 使用 2001:db8:3e8:100::/64 地址空间中的地址,因为 C 是攻击目标。
The preferred method for an attacker is this: instruct A to use a spoofed, or hijacked, address. For instance, the botnet controller may instruct A to use an address in 2001:db8:3e8:100::/64 address space because C is the attack’s target.
B 有一种简单的方法可以阻止这种欺骗性流量。当交换流量时,B 可以查找到被交换数据包的源地址的路由。如果源地址是
There is a somewhat simple way for B to block this spoofed traffic. When switching traffic, B can look up the route to the source address of the packet being switched. If the source address is
• 不可达,数据包应被丢弃;这是松散的 uRPF。
• Not reachable, the packet should be dropped; this is loose uRPF.
• 只能通过接收数据包的接口以外的某个接口到达,丢弃该数据包;这是严格的uRPF。
• Reachable only through some interface other than the one the packet was received on, drop the packet; this is strict uRPF.
如果 B 配置了严格的 uRPF,至少在客户连接的端口(例如 A 连接的端口)上,则来自 A 且具有 B 源地址的流量将被丢弃。
If B is configured with strict uRPF, at least on the ports to which customers are connected (such as the port that A is connected to), then traffic sourced from A, with B’s source address, would be dropped.
如果 uRPF 可以防止多种形式的反射 DDoS 攻击,为什么不在每个端口上都配置它呢?严格的 uRPF 并不适用于所有情况;有许多合理的原因导致数据包无法进入路由器用于到达源地址的同一接口。造成这种情况的主要原因是双归属情况,即提供商仅安装一条到目的地的路由,但数据包由实际主机沿两条路由传输。以不影响超高速链路性能的方式实现 uRPF 也很困难。
If uRPF can prevent many forms of reflection DDoS attacks, why is it not configured on every port? Strict uRPF does not work in all situations; there are many legitimate reasons why a packet may not be entering the same interface the router would use to reach the source address. The primary reason for this is dual-homing situations, where the provider installs just one route to the destination, but packets are transmitted along both routes by the actual hosts. It is also difficult to implement uRPF in a way that does not impact the performance of very high-speed links.
大规模 DDoS 攻击的问题之一是您的整个上游链路可能会在攻击中被消耗。一种解决方案是通知您的上游提供商阻止 DDoS 流量。Flowspec可用于在BGP中承载数据包级过滤规则。总体思路是这样的:您将一组特殊格式的社区发送给您的提供商,然后提供商会自动使用这些社区在您的 Internet 链接的入站端创建过滤器。如 RFC5575bis 中所述,flowspec 编码有两个部分:匹配规则和操作规则。匹配规则的编码如图21-9所示。
One of the problems with a large-scale DDoS attack is that your entire upstream link can be consumed in the attack. One solution is to signal your upstream provider to block the DDoS flows. Flowspec can be used to carry packet-level filter rules in BGP. The general idea is this: you send a set of specially formatted communities to your provider, who then automagically uses those communities to create filters at the inbound side of your link to the Internet. There are two parts to the flowspec encoding, as outlined in RFC5575bis: the match rule and the action rule. The match rule is encoded as shown in Figure 21-9.
您可以匹配多种条件。源地址和目标地址非常简单。对于 IP 协议和端口号,运营商子 TLV 允许您指定一组要匹配的条件,以及是否对条件进行AND(所有条件必须匹配)或对条件进行OR (列表中的任何条件都可以匹配) 。端口范围,大于、小于支持大于、大于或等于、小于或等于和等于。片段、TCP 标头标志和许多其他标头信息也可以进行匹配。
There are a wide range of conditions you can match on. The source and destination addresses are pretty straightforward. For the IP protocol and port numbers, the operator sub-TLVs allow you to specify a set of conditions to match on, and whether to AND the conditions (all conditions must match) or OR the conditions (any condition in the list may match). Ranges of ports, greater than, less than, greater than or equal to, less than or equal to, and equal to are all supported. Fragments, TCP header flags, and a number of other header information can be matched on, as well.
流量匹配后,您将如何处理它?有许多规则,包括
Once the traffic is matched, what do you do with it? There are a number of rules, including
• 以每秒字节数或每秒数据包数控制流量速率
• Control the traffic rate in either bytes per second or packets per second
• 将流量重定向至 VRF
• Redirect the traffic to a VRF
• 使用特定的 DSCP 位标记流量
• Mark the traffic with a particular DSCP bit
• 过滤流量
• Filter the traffic
如果您认为编码一定很复杂,那么您是对的。这就是为什么大多数实现允许您设置非常简单的规则,并为您处理所有编码位。给定 flowspec 编码,您应该能够检测到攻击,在 BGP 中设置一些简单的规则,将正确的“内容”发送给您的提供商,然后看着 DDoS 消失。如果您从事网络工程的时间早于“我昨天开始”,那么您现在应该知道了——没有什么比这更简单的了。
If you think this must be complicated to encode, you are right. This is why most implementations allow you to set pretty simple rules, and handle all the encoding bits for you. Given flowspec encoding, you should just be able to detect the attack, set some simple rules in BGP, send the right “stuff” to your provider, and watch the DDoS go away. If you have been in network engineering since longer than “I started yesterday,” you should know by now—nothing is ever this simple.
如果您没有看到权衡,那么您还没有仔细考虑。
If you do not see a tradeoff, you have not looked hard enough.
首先,从提供商的角度来看,flowspec 是一个全新的攻击面。您不能让您的客户向您发送其喜欢的任何 flowspec 规则。例如,如果您的客户向您发送一条 flowspec 规则,阻止流量流向您的一台 DNS 服务器,该怎么办?或者,也许是对其竞争对手之一?或者甚至是它自己的 BGP 会话?为了防止此类问题,大多数提供商都会将任何 flowspec 发起的规则仅应用于直接连接到网络的端口。这可以保护您的网络和提供商之间的链接,但如果提供商允许在其网络中更深入地实施这些 flowspec 规则,则几乎没有办法防止滥用。
First, from a provider’s perspective, flowspec is an entirely new attack surface. You cannot let your customer just send you whatever flowspec rules it likes. For instance, what if your customer sends you a flowspec rule blocking traffic to one of your DNS servers? Or, perhaps, to one of its competitors? Or even to its own BGP session? Most providers, to prevent these types of problems, will apply any flowspec-initiated rules to just the port connecting to your network directly. This protects the link between your network and the provider, but there is little way to prevent abuse if the provider allows these flowspec rules to be implemented deeper in its network.
其次,过滤是要花钱的。这在单个链路规模上可能并不明显,但是当您开始考虑如何基于深度数据包检测规则类型过滤多个千兆位流量时(特别是考虑到能够在单个 flowspec 过滤器规则中组合多个规则)在实际的数据包交换过程中需要大量的资源。任何给定的数据包处理引擎 (ASIC) 上的此类资源数量都是有限的,并且许多客户可能想要进行过滤。由于过滤需要提供商花钱,因此很可能会对 flowspec 收费,限制客户可以向其发送 flowspec 规则(通常基于提供商对客户的洞察力的看法),甚至限制可以实施的 flowspec 规则的数量在任何给定时间。
Second, filtering costs money. This might not be obvious at a single link scale, but when you start considering how to filter multiple gigabits of traffic based on deep packet inspection sorts of rules—particularly given the ability to combine a number of rules in a single flowspec filter rule—filtering requires a lot of resources during the actual packet switching process. There is a limited number of such resources on any given packet processing engine (ASIC) and a lot of customers who are likely going to want to filter. Since filtering costs the provider money, it is most likely going to charge for flowspec, limit which customers can send it flowspec rules (generally grounded in the provider’s perception of the customer’s cluefulness), and even limit the number of flowspec rules that can be implemented at any given time.
许多设备将使用本地信息以及从广泛网络收集的分析来发现和阻止特定于 DDoS 的流量。这些设备可以作为安全设备部署在网络内部、边缘路由器的前面或后面。DDoS 防护服务还可以清除您的入站流量;图 21-10说明了这些服务的一种工作方式。
A number of appliances will use local information, along with analytics gathered from a wide range of networks, to discover and block DDoS-specific flows. These appliances can be deployed inside your network, in front of or behind your edge router, as a security device. DDoS protection services can scrub your inbound traffic, as well; Figure 21-10 illustrates one way in which these services work.
图21-10中有五个步骤:
There are five steps in Figure 21-10:
1. 主机 A 向 DNS 服务器请求某个域(例如 example.com)的 IP 地址。
1. A host, A, requests the IP address for some domain, say example.com, from a DNS server.
2. DNS 服务器使用指向 DDoS 清理器服务(托管在内容或服务提供商网络中)的 IP 地址进行响应。
2. The DNS server responds with an IP address pointing to the DDoS scrubber service, hosted in a content or service provider’s network.
3. 主机将其流量发送到 B 处的清理器服务。
3. The host sends its traffic to the scrubber service at B.
4. B 删除所有 DDoS 流量,仅留下有效吞吐量,然后通过 Internet 将剩余流量通过隧道传输至原始服务器 C。
4. B removes any DDoS traffic, leaving just the goodput, and then tunnels the remaining traffic across the Internet to the original server, C.
5. 服务器正常响应请求,将信息直接发送回请求主机。
5. The server responds to the request as normal, sending the information directly back to the requesting host.
属于僵尸网络一部分的任何设备也将收到清理器的地址,作为访问受攻击服务的正确地址。清理器服务通常位于能够消耗许多千兆流量的网络中,删除任何看似 DDoS 攻击的流量,并将剩余流量发送到原始流量。服务器。此类清理服务远远超出了检查流量的范围,它使用有关活动僵尸网络的近实时信息、来自 DNS 查询的信息以及其他因素来确定哪些流量是有效流量以及哪些流量是 DDoS 攻击的一部分。
Any device that is part of a botnet will also receive the scrubber’s address as the correct one to reach the service under attack. The scrubber service is normally positioned in a network able to consume many gigabits of traffic, remove any traffic that appears to be part of a DDoS attack, and send the remaining traffic on to the original server. Such scrubbing services go far beyond examining the traffic, using near-real-time information about active botnets, information from DNS queries, and other factors to determine which traffic is goodput and which is part of the DDoS attack.
OODA 循环最初由美国空军约翰·伯德上校开发,旨在帮助战斗机飞行员在生死攸关的情况下快速做出决策。虽然网络安全似乎远远超出了军用飞机的范围,但 OODA 循环已被证明在许多不同的安全相关(更普遍的是反应相关)情况下很有用。OODA 循环由四个步骤组成:
The OODA loop was originally developed by Colonel John Byrd of the United States Air Force to help fighter pilots manage decisions quickly in life-or-death situations. While network security might seem to be far outside the realm of military aircraft, the OODA loop has proven useful in a number of different security-related (and more generally reaction-related) situations. The OODA loop consists of four steps:
• 观察
• Observe
• 东方
• Orient
• 决定
• Decide
• 行为
• Act
这四个步骤以字母 O、O、D 和 A 开头 — 因此是 OODA 循环。如图 21-11所示。
The four steps begin with the letters O, O, D, and A—hence the OODA loop. Figure 21-11 illustrates.
如果您听说过“需要进入循环”这个表达式,那么它来自 OODA 循环。拥有“最紧循环”或能够最快穿过循环的人将赢得比赛。在网络安全方面,您必须能够进入威胁行为者的循环,抢在他之前并找到阻止正在进行的攻击的方法。
If you have ever heard the expression you need to get inside the loop, this comes from the OODA loop. The person who has the “tightest loop,” or who can move through the loop the fastest, will win the contest. In terms of network security, you must be able to get inside the threat actor’s loop to get ahead of him and to find ways to stop the attack in progress.
这四个步骤中的每一个步骤都值得仔细研究。
Each of the four steps deserves a closer look.
您应该观察什么?应该在哪里观察?在某些情况下,这是最重要的问题,也是最难回答的问题。您是否应该测量网络中特定点的平均流量?特定点的平均抖动?平均延迟?路由表中有多少条路由?路由表变化的速率?
What should you observe, and where should you observe it? In some cases, this is the most important question to ask and the hardest to answer. Should you measure the average traffic flow across specific points in the network? The average jitter across specific points? The average delay? The number of routes in the routing table? The rate at which the routing table changes?
正确的答案是衡量一切能让你对网络日常运行有良好感觉的东西——但要谨慎一些。从每个设备和网络的每个部分获取遥测数据,如果信息没有用,则在短时间内将其丢弃,这似乎没有什么坏处。然而,现实的答案是,您必须经过多次尝试和错误,并充分考虑交通流、故障模式等后,仔细选择观察点。
The right answer is to measure everything that will give you a good feel for the day-today operation of the network—with some caution. There may seem to be little harm in acquiring telemetry data from every device and every part of the network possible, and throwing the information away after a short period of time if it does not prove useful. The realistic answer, however, is you must choose your observation points carefully, after much trial and error, and after much thought about traffic flows, failure modes, etc.
然而,观察中隐藏着第二点:除非你记录,否则你怎么知道你正在观察什么?正如一句老话所说,“如果你没有写下来,它就没有发生”——在观察的世界里,没有什么比这更真实的了。除非你知道过去发生了什么,否则知道现在发生的事情是没有意义的。
There is a second point hidden in observe, however: how do you know what you’re observing unless you record? As the old saying goes, “if you didn’t write it down, it didn’t happen”—and nothing is truer than this in the world of observation. There’s no point in knowing what’s happening right now unless you know what has happened in the past.
一旦你进行了一组观察,你就需要决定你要观察的是什么。考虑图 21-12所示的简单视错觉。
Once you’ve made a set of observations, you need to decide what it is you’re observing. Consider the simple optical illusion shown in Figure 21-12.
在图21-12中,有两组正方形,其中一组强加在几何线条背景上。右边的正方形实际上是正方形。然而,在左边,正方形看起来并不正方形。它们似乎被扭曲了。您所看到的内容通常取决于上下文以及存在的内容。因此,观察是一项可以随着时间的推移而发展的技能。观察网络以了解其正常状态就像任何其他观察技能一样。工程师如何培养观察能力?
In Figure 21-12, there are two sets of squares, one of which is imposed on a background of geometric lines. On the right, the squares are actually square. On the left, however, the squares do not look square; they appear to be distorted. What you see is often determined by the context as much as what is there. Observing, therefore, is a skill you can develop over time. Observing a network to understand its normal state is like any other observational skill. How can engineers develop observation skills?
首先,从理论层面了解网络、协议和应用程序的运行。超越命令行,深入了解网络中设备的实际操作,了解路由器如何转发数据包,或 OSPF 如何构建和处理数据包,可以使您了解所观察到的情况的能力产生巨大差异。
First, understand the operation of the network, protocols, and applications at a theoretical level. Reaching beyond the command line, and into the actual operation of the devices in the network—understanding how a router forwards packets, or how OSPF builds and processes packets, can make a huge difference in your ability to orient yourself to what you are observing.
其次,学习和应用模型有很大的帮助。您可能对这个图中的视错觉感到困难的原因是这些盒子看起来接近正方形,所以您立即认为它们一定是正方形。你的头脑中有一个“正方形模型”,当你看到接近模型的物体时,你会尝试使物体适合。因此,了解适合任何问题的各种模型非常重要,或者在网络世界中,了解可用于“查看”协议、应用程序、设备和网络操作的各种模型。添加到“心理模型集”中的每个附加模型都可以让您更快地定位自己。
Second, learning and applying models is a huge help. The reason you probably have trouble with the optical illusion in this figure is the boxes appear close to being squares, so you immediately think they must be squares. You have a “model of a square,” in your head, and when you see things close to the model, you try to make the object fit. So it is important to know a wide array of models into which any problem can fit—or, in the networking world, a wide array of models you can use to “see” protocol, application, device, and network operation. Each additional model you add to your “mental model set” allows you to orient yourself a bit faster.
整个过程很像定向运动。首先,获取指向北方的地图。然后,在地图上找到与您所在位置相匹配的要素,然后从那里逐个要素前往目的地。没有定位地图就无法将背景与信息分开。无法看到周围区域意味着无法收集将地图与现实相匹配所需的信息。不知道地图上的符号就没有足够的心理模型来使地图和现实之间发生匹配。
This entire process is much like orienteering. First, get the map pointing north. Then, find the features on the map matching your location, and work from there to the destination, feature by feature. Not orienting the map is failing to separate the background from the information. Not being able to see the surrounding area is failing to collect the information necessary to match the map to the reality. Not knowing the symbols on the map is failing to have enough mental models to make the match between map and reality happen.
做出任何决定的最糟糕时间是凌晨 2 点,网络中断期间,此时您面临着让业务恢复并运行的压力。但考虑到网络中断永远不会在方便的时候发生,您如何才能避免做出此类决定呢?
The worst time to make any sort of decision is at 2 a.m., in the middle of a network outage, when you are under pressure to get the business back up and running. But given network outages never happen at a convenient time, how can you avoid making these kinds of decisions?
You can decide what you are going to decide before you must decide.
这听起来可能有点迂回;也许自卫训练班的例子会有所帮助。如果有人攻击您,什么时候是决定您要去哪里的最佳时机?在他这样做之前,还是在他这样做的时候?防御性驾驶也不例外:如果您前面的汽车突然打转、车轮脱落或发生其他您可能意想不到的事情,最好知道您会去哪里。
This might sound a little roundabout; perhaps an example from the world of self-defense training classes would be helpful. When is the best time to decide where you are going to go if someone attacks you? Before he does so, or while he is doing so? Defensive driving is no different: it is always best to know where you would go if the car in front of you suddenly spins out, or the wheels fall off, or some other thing you might not expect to happen.
这个预决策过程在网络环境中非常有帮助。例如:
This pre-decision process can be very helpful in a network environment. For instance:
• 您将在哪里放置过滤器来阻止这种特定类型的流量?
• Where would you put a filter to block this particular type of traffic?
• 您将删除哪些并行链路来消除正反馈循环,从而防止路由协议收敛?
• Which parallel links would you remove to kill off the positive feedback loop keeping your routing protocol from converging?
• 当您试图找出数据中心结构突然变得如此热的原因时,您可以暂时关闭哪些服务器?
• What servers can you shut down for a time while you are trying to figure out why the data center fabric has become so hot all of a sudden?
所有这些决定都是您在行动开始之前(在您必须决定做某事之前)可以做出的选择。换句话说,决定你需要做什么,这样当你需要做的时候,你就会有一个适当的计划。
All of these decisions are choices you can make before the action starts—before you have to decide to do something. In other words, decide what you need to do so that when it comes time to do it, you will have a plan in place.
一旦您观察到情况、了解正在发生的情况,并考虑了在凌晨 2 点接到网络故障电话之前所做的预先计划的决定,那么采取行动应该很容易,但采取行动通常比应有的困难。为什么?
It should be easy to act once you have observed the situation, oriented yourself to what is happening, and considered the preplanned decisions you made before the 2 a.m. call that the network is down—but it often harder to act than it should be. Why?
首先,通常很难相信“这确实发生了”。这是自卫情况下的常见问题;当你第一次遇到问题时,你不想适应新的情况。相反,你宁愿忽略这个问题并“继续生活”。这是《圣诞颂歌》中的斯克鲁奇对马利说的:“你身上的肉汁多于坟墓。” 然而,在现实世界中,这可能是一种代价高昂的反应方式。
First, it is often hard to believe “this is actually happening.” This is a common problem in self-defense situations; when you first encounter a problem, you do not want to adjust to the new situation. Rather, you would rather just ignore the problem and “move on with life.” This is Scrooge, in A Christmas Carol, saying to Marley, “there is more gravy than grave about you.” In the real world, however, this can be a very costly way to react.
其次,在做出实际决定的那一刻,自然会伴随着质疑的风暴。您真的观察到了这次特殊的攻击吗?如果你错了怎么办——后果会比攻击本身更糟糕吗?
Second, a storm of doubts will naturally accompany the actual moment of decision. Did you really observe this particular attack? What if you are wrong—will the consequences be worse than the attack itself?
这两个问题的答案都在于 OODA 循环本身。如果你已经标出了观察点,如果你根据已有的背景信息确定了自己的方向,如果你遵循在更理性的时代做出的预先决定,那么下一步行动就是正确的选择。
The answer to both of these problems lies in the OODA loop itself. If you have staked out observation points, if you have oriented yourself against the background information you have, if you are following premade decisions made in more rational times, then acting is the right next step to take.
磨练你的技能,了解你的网络,了解你的监控点,了解你在看什么,了解你的计划,然后去做。制定你的计划,然后相信你的计划。
Hone your skills, know your network, know your monitoring points, know what you are looking at, know your plan, and do it. Make your plan, and then trust your plan.
安全并不简单;这是一个广阔的领域,有很多坑洼,你可以踩进去,但常常没有意识到你实际上刚刚踩进了坑洼。防火墙和非军事区 (DMZ) 的城墙已成为遥远的过去。大炮早已被发明,旧网络工程世界的城堡墙只是早已逝去的时代的证明。威胁行为者无处不在,因此防御也必须无处不在。
Security is not simple; it is a broad field with a lot of potholes you can step in to, often without realizing you have, in fact, just stepped into a pothole. The castle walls of firewalls and demilitarized zones (DMZs) are in the distant past. Cannons have long since been invented, and the castle walls of the older world of network engineering are just testaments to a time long gone. Threat actors are everywhere, so defense must be everywhere, as well.
本章并未考虑所有可能的防御,包括微分段和白名单等流行主题,但它描述了一组有助于理解网络安全世界的有用心理工具。
This chapter has not considered every possible defense, including such popular topics as microsegmentation and white listing, but it has described a set of helpful mental tools for understanding the world of network security.
巴彻、马丁、罗伯特·拉斯祖克、苏珊·黑尔斯、丹尼·麦克弗森和克里斯托夫·洛伊布尔。“流量规范规则的传播。” 互联网草案。互联网工程任务组,2017 年 2 月。https: //datatracker.ietf.org/doc/html/draft-hr-idr-rfc5575bis-03。
Bacher, Martin, Robert Raszuk, Susan Hares, Danny R. McPherson, and Christoph Loibl. “Dissemination of Flow Specification Rules.” Internet-Draft. Internet Engineering Task Force, February 2017. https://datatracker.ietf.org/doc/html/draft-hr-idr-rfc5575bis-03.
美国有线电视新闻网 (CNN) 和玫琳凯·马洛尼。“黑客公布了 20,000 名 FBI 员工的联系信息。” 美国有线电视新闻网。访问日期:2017 年 4 月 23 日。http ://www.cnn.com/2016/02/08/politics/hackers-fbi-employee-info/index.html。
CNN and Mary Kay Mallonee. “Hackers Publish 20,000 FBI Employees’ Contact Information.” CNN. Accessed April 23, 2017. http://www.cnn.com/2016/02/08/politics/hackers-fbi-employee-info/index.html.
Czyz, J.、M. Luckie、M. Allman 和 M. Bailey。“别忘了锁后门!IPv6 网络安全策略的特征。” 网络和分布式系统安全 (NDSS),2016 年。https://www.caida.org/publications/papers/2016/dont_forget_lock/。
Czyz, J., M. Luckie, M. Allman, and M. Bailey. “Don’t Forget to Lock the Back Door! A Characterization of IPv6 Network Security Policy.” In Network and Distributed Systems Security (NDSS), 2016. https://www.caida.org/publications/papers/2016/dont_forget_lock/.
“数据泄露影响了 80,000 名加州大学伯克利分校教职员工、学生和校友。” 文本.文章. FoxNews.com,2016 年 2 月 28 日。http ://www.foxnews.com/tech/2016/02/28/data-breach-affects-80000-uc-berkeley-faculty-students-and-alumni.html。
“Data Breach Affects 80,000 UC Berkeley Faculty, Students and Alumni.” Text.Article. FoxNews.com, February 28, 2016. http://www.foxnews.com/tech/2016/02/28/data-breach-affects-80000-uc-berkeley-faculty-students-and-alumni.html.
多宾斯、罗兰、罗伯特·莫斯科维茨、尼克·蒂格、梁夏、Kaname Nishizuka、Stefan Fouant 和 Daniel Migault。“DDoS 开放威胁信号的用例。” 互联网草案。互联网工程任务组,2017 年 3 月。https ://datatracker.ietf.org/doc/html/draft-ietf-dots-use-cases-04。
Dobbins, Roland, Robert Moskowitz, Nik Teague, Liang Xia, Kaname Nishizuka, Stefan Fouant, and Daniel Migault. “Use Cases for DDoS Open Threat Signaling.” Internet-Draft. Internet Engineering Task Force, March 2017. https://datatracker.ietf.org/doc/html/draft-ietf-dots-use-cases-04.
“EANS-Adhoc:FACC AG /更新:FACC AG — 网络欺诈”,2016 年 1 月 20 日。http: //www.facc.com/en/content/view/full/3958。
“EANS-Adhoc: FACC AG / UPDATE: FACC AG—Cyber-Fraud,” January 20, 2016. http://www.facc.com/en/content/view/full/3958.
弗格森、保罗. 网络入口过滤:抵御采用 IP 源地址欺骗的拒绝服务攻击。征求意见 2827。RFC 编辑,2000。doi:10.17487/rfc2827。
Ferguson, Paul. Network Ingress Filtering: Defeating Denial of Service Attacks Which Employ IP Source Address Spoofing. Request for Comments 2827. RFC Editor, 2000. doi:10.17487/rfc2827.
戈雷尔、迈克. “盐湖县数据泄露暴露了 14,200 人的信息。” 盐湖论坛报。访问日期:2017 年 4 月 23 日。http: //www.sltrib.com/home/3705923-155/data-breach-exposed-info-of-14200。
Gorrell, Mike. “Salt Lake County Data Breach Exposed Info of 14,200 People.” Salt Lake Tribune. Accessed April 23, 2017. http://www.sltrib.com/home/3705923-155/data-breach-exposed-info-of-14200.
赫利,科马克。“安全声明的不可证伪性。” 美利坚合众国国家科学院院刊, 113,no。23(2016 年 6 月 7 日):6415–20。doi:10.1073/pnas.1517797113。
Herley, Cormac. “Unfalsifiability of Security Claims.” Proceedings of the National Academy of Sciences of the United States of America 113, no. 23 (June 7, 2016): 6415–20. doi:10.1073/pnas.1517797113.
“微分段如何帮助安全?解释。” SDxCentral,2016 年 3 月 8 日。https ://www.sdxcentral.com/sdn/network-virtualization/definitions/how-does-micro-segmentation-help-security-explanation/。
“How Does Micro-Segmentation Help Security? Explanation.” SDxCentral, March 8, 2016. https://www.sdxcentral.com/sdn/network-virtualization/definitions/how-does-micro-segmentation-help-security-explanation/.
坎德尔瓦尔,斯瓦蒂。“世界上最大的 1 Tbps DDoS 攻击是从 152,000 个被黑的智能设备发起的。” 黑客新闻。访问日期:2017 年 4 月 25 日。http: //thehackernews.com/2016/09/ddos-attack-iot.html。
Khandelwal, Swati. “World’s Largest 1 Tbps DDoS Attack Launched from 152,000 Hacked Smart Devices.” Hacker News. Accessed April 25, 2017. http://thehackernews.com/2016/09/ddos-attack-iot.html.
克虏伯、约翰内斯、迈克尔·巴克斯和克里斯蒂安·罗索。“识别放大 DDoS 攻击背后的扫描和攻击基础设施。” 2016 年 ACM SIGSAC 计算机和通信安全会议记录,1426-37。CCS'16。美国纽约州纽约:ACM,2016。doi:10.1145/2976749.2978293。
Krupp, Johannes, Michael Backes, and Christian Rossow. “Identifying the Scan and Attack Infrastructures Behind Amplification DDoS Attacks.” In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 1426–37. CCS ’16. New York, NY, USA: ACM, 2016. doi:10.1145/2976749.2978293.
利里、朱迪. “国税局数据泄露不断增加。” IdentityForce®,2016 年 2 月 29 日。https: //www.identityforce.com/blog/irs-data-breach-more-taxpayers-affected。
Leary, Judy. “IRS Data Breach Grows.” IdentityForce®, February 29, 2016. https://www.identityforce.com/blog/irs-data-breach-more-taxpayers-affected.
———。“UCF 数据泄露。” IdentityForce,2016 年 2 月 8 日。https ://www.identityforce.com/blog/ucf-data-breach-affects-63000。
———. “UCF Data Breach.” IdentityForce, February 8, 2016. https://www.identityforce.com/blog/ucf-data-breach-affects-63000.
———。“Verizon 企业数据泄露。” IdentityForce,2016 年 3 月 25 日。https: //www.identityforce.com/blog/verizon-enterprise-data-breach。
———. “Verizon Enterprise Data Breach.” IdentityForce, March 25, 2016. https://www.identityforce.com/blog/verizon-enterprise-data-breach.
马什,詹妮弗。“如何使用日志分析检测和分析 DDoS 攻击。” Loggly,2016 年 3 月 2 日。https ://www.loggly.com/blog/how-to-detect-and-analyze-ddos-attacks-using-log-analysis/。
Marsh, Jennifer. “How to Detect and Analyze DDoS Attacks Using Log Analysis.” Loggly, March 2, 2016. https://www.loggly.com/blog/how-to-detect-and-analyze-ddos-attacks-using-log-analysis/.
麦金尼、马特. “数据泄露泄露了 3,000 多名 TCC 员工的信息。” 弗吉尼亚飞行员。访问日期:2017 年 4 月 23 日。http ://pilotonline.com/news/local/crime/data-breach-exposes-information-on-more-than-tccemployees/article_6ab72a2f-52a0-533e-8060-a2d245c7f151.html。
McKinney, Matt. “Data Breach Exposes Information on More than 3,000 TCC Employees.” Virginian-Pilot. Accessed April 23, 2017. http://pilotonline.com/news/local/crime/data-breach-exposes-information-on-more-than-tccemployees/article_6ab72a2f-52a0-533e-8060-a2d245c7f151.html.
安德鲁·莫滕森、弗莱明·安德烈森、蒂鲁马莱斯瓦尔·雷迪、克里斯托弗·格雷、里奇·康普顿和尼克·蒂格。“分布式拒绝服务开放威胁信令 (DOTS) 架构。” 互联网草案。互联网工程任务组,2016 年 10 月。https ://datatracker.ietf.org/doc/html/draft-ietf-dots-architecture-01。
Mortensen, Andrew, Flemming Andreasen, Tirumaleswar Reddy, Christopher Gray, Rich Compton, and Nik Teague. “Distributed-Denial-of-Service Open Threat Signaling (DOTS) Architecture.” Internet-Draft. Internet Engineering Task Force, October 2016. https://datatracker.ietf.org/doc/html/draft-ietf-dots-architecture-01.
安德鲁·莫滕森、罗伯特·莫斯科维茨和蒂鲁马莱斯瓦尔·雷迪。“分布式拒绝服务 (DDoS) 开放威胁信号要求。”互联网草案。互联网工程任务组,2017 年 3 月。https ://datatracker.ietf.org/doc/html/draft-ietf-dots-requirements-04。
Mortensen, Andrew, Robert Moskowitz, and Tirumaleswar Reddy. “Distributed Denial of Service (DDoS) Open Threat Signaling Requirements.” Internet-Draft. Internet Engineering Task Force, March 2017. https://datatracker.ietf.org/doc/html/draft-ietf-dots-requirements-04.
芒卡斯特、菲尔. “菲律宾的每个选民都暴露在大型黑客攻击中。” 《信息安全》杂志,2016 年 4 月 8 日。https ://www.infosecurity-magazine.com/news/every-voter-in-philippines-exposed/。
Muncaster, Phil. “Every Voter in Philippines Exposed in Mega Hack.” Infosecurity Magazine, April 8, 2016. https://www.infosecurity-magazine.com/news/every-voter-in-philippines-exposed/.
“10 亿雅虎账户因数据泄露而受损 | 身份力量。” 访问日期:2017 年 4 月 23 日。https: //www.identityforce.com/blog/one-billion-yahoo-accounts-compromished-new-data-breach。
“1 Billion Yahoo Accounts Compromised in Data Breach | IdentityForce.” Accessed April 23, 2017. https://www.identityforce.com/blog/one-billion-yahoo-accounts-compromised-new-data-breach.
“Premier Healthcare 面临可能的数据泄露,可能影响 200,000 名患者。” 医疗保健 IT 新闻,2016 年 3 月 9 日。http ://www.healthcareitnews.com/news/premier-healthcare-faces-possible-data-breach-could-affect-200000-病人。
“Premier Healthcare Faces Possible Data Breach That Could Affect 200,000 Patients.” Healthcare IT News, March 9, 2016. http://www.healthcareitnews.com/news/premier-healthcare-faces-possible-data-breach-could-affect-200000-patients.
里克特、安迪和杰里米·伍德。思科身份服务引擎 (ISE) 的实际部署:AAA 部署的真实示例。第一版。马萨诸塞州沃尔瑟姆:Syngress,2015。
Richter, Andy, and Jeremy Wood. Practical Deployment of Cisco Identity Services Engine (ISE): Real-World Examples of AAA Deployments. 1st edition. Waltham, MA: Syngress, 2015.
里格尼、卡尔. 半径会计。征求意见 2866。RFC 编辑,2000。doi:10.17487/rfc2866。
Rigney, Carl. RADIUS Accounting. Request for Comments 2866. RFC Editor, 2000. doi:10.17487/rfc2866.
鲁本斯、艾伦、卡尔·里格尼、史蒂夫·威伦斯和威廉·辛普森。远程身份验证拨入用户服务 (RADIUS)。征求意见 2865。RFC 编辑,2000。doi:10.17487/rfc2865。
Rubens, Allan, Carl Rigney, Steve Willens, and William A. Simpson. Remote Authentication Dial In User Service (RADIUS). Request for Comments 2865. RFC Editor, 2000. doi:10.17487/rfc2865.
桑图卡、维韦克、普雷姆迪普·班加和布兰登·詹姆斯·卡罗尔。AAA 身份管理安全。第一版。印第安纳州印第安纳波利斯:思科出版社,2010 年。
Santuka, Vivek, Premdeep Banga, and Brandon James Carroll. AAA Identity Management Security. 1st edition. Indianapolis, IN: Cisco Press, 2010.
“保护 Apache,第 8 部分:DoS 和 DDoS 攻击。” 为您开源,2011 年 4 月 1 日。http://opensourceforu.com/2011/04/securing-apache-part-8-dos-ddos-attacks/。
“Securing Apache, Part 8: DoS & DDoS Attacks.” Open Source for You, April 1, 2011. http://opensourceforu.com/2011/04/securing-apache-part-8-dos-ddos-attacks/.
西西利亚诺,罗伯特. “雅虎数据泄露:近 5 亿人受到影响。” 身份力量,
Siciliano, Robert. “Yahoo Data Breach: Almost 500 Million Affected.” IdentityForce,
2016 年 9 月 22 日。https: //www.identityforce.com/blog/yahoo-data-breach-almost-500-million-affected。
September 22, 2016. https://www.identityforce.com/blog/yahoo-data-breach-almost-500-million-affected.
西蒙诺夫斯基、米利沃伊、吉安卡洛·佩莱格里诺、克里斯蒂安·罗索和迈克尔·巴克斯。“谁控制着互联网?:使用属性图遍历分析全球威胁。” 第 26 届万维网国际会议记录,647-56。万维网'17。瑞士日内瓦共和国和州:国际万维网会议指导委员会,2017 年。doi:10.1145/3038912.3052587。
Simeonovski, Milivoj, Giancarlo Pellegrino, Christian Rossow, and Michael Backes. “Who Controls the Internet?: Analyzing Global Threats Using Property Graph Traversals.” In Proceedings of the 26th International Conference on World Wide Web, 647–56. WWW ’17. Republic and Canton of Geneva, Switzerland: International World Wide Web Conferences Steering Committee, 2017. doi:10.1145/3038912.3052587.
“了解并缓解基于 NTP 的 DDoS 攻击。” Cloudflare 博客,2014 年 1 月 9 日。http: //blog.cloudflare.com/understanding-and-mitigating-ntp-based-ddos-attacks/。
“Understanding and Mitigating NTP-Based DDoS Attacks.” Cloudflare Blog, January 9, 2014. http://blog.cloudflare.com/understanding-and-mitigating-ntp-based-ddos-attacks/.
瓦卡,约翰·R.,编辑。网络和系统安全。第二版。马萨诸塞州沃尔瑟姆:Syngress,2014 年。
Vacca, John R., ed. Network and System Security. 2nd edition. Waltham, MA: Syngress, 2014.
韦姆帕蒂、贾甘纳德、马克·汤普森和拉姆·丹图。“面对攻击时的弹性反馈控制。” 第十二届网络与信息安全研究年会论文集,17:1–17:7。中国证监会'17。纽约州纽约:ACM,2017。doi:10.1145/3064814.3064815。
Vempati, Jagannadh, Mark Thompson, and Ram Dantu. “Feedback Control for Resiliency in Face of an Attack.” In Proceedings of the 12th Annual Conference on Cyber and Information Security Research, 17:1–17:7. CISRC ’17. New York, NY: ACM, 2017. doi:10.1145/3064814.3064815.
怀特、拉斯和博拉·阿克奥尔。验证 BGP 路径时的注意事项。征求意见 5123。RFC 编辑,2008。doi:10.17487/RFC5123。
White, Russ, and Bora Akyol. Considerations in Validating the Path in BGP. Request for Comments 5123. RFC Editor, 2008. doi:10.17487/RFC5123.
1. 查找现实生活中的数据泄露事件;识别威胁行为者、漏洞利用、漏洞、资产和风险。
1. Find a real-life data breach; identify the threat actor, the exploit, the vulnerability, the assets, and the risks.
2. 虽然许多工程师认为验证通过路由计算的路径将提高路径上流量的安全性,但这种方法似乎存在许多问题,RFC5123 中概述了其中一些问题。解释您认为保护路由和保护流经网络的流量之间的关系。
2. While many engineers argue validating the path calculated through routing will improve the security of the traffic flowing over the path, there appear to be a number of problems with this approach, some of which are outlined in RFC5123. Explain what you think is the relationship between securing routing and securing traffic flowing through the network.
3、文中将RADIUS称为AAA的一种形式;找到另一种形式的 AAA 并简要描述它。
3. RADIUS is called out as a form of AAA in the text; find one other form of AAA and briefly describe it.
4.什么是入侵检测系统?描述它的作用。
4. What is an intrusion detection system? Describe what it does.
5. 什么是数据泄露检测系统?描述它的作用。
5. What is a data exfiltration detection system? Describe what it does.
6. 如果单播 RPF 能够有效阻止全球互联网中的攻击,为什么在全球互联网中提供传输连接的提供商很少部署它?
6. If unicast RPF would be effective at blocking attacks in the global Internet, why do so few providers offering transit connectivity in the global Internet deploy it?
1 . “EANS-Adhoc:FACC AG /更新:FACC AG - 网络欺诈。”
1. “EANS-Adhoc: FACC AG / UPDATE: FACC AG - Cyber-Fraud.”
2 . Leary,“UCF 数据泄露。”
2. Leary, “UCF Data Breach.”
3 . CNN 和 Mallonee,“黑客公布了 20,000 名 FBI 员工的联系信息。”
3. CNN and Mallonee, “Hackers Publish 20,000 FBI Employees’ Contact Information.”
4 . Leary,“国税局数据泄露不断增加。”
4. Leary, “IRS Data Breach Grows.”
5 . “数据泄露影响了 80,000 名加州大学伯克利分校教职员工、学生和校友。”
5. “Data Breach Affects 80,000 UC Berkeley Faculty, Students and Alumni.”
6 . “Premier Healthcare 面临可能的数据泄露,可能影响 200,000 名患者。”
6. “Premier Healthcare Faces Possible Data Breach That Could Affect 200,000 Patients.”
7 . Leary,“Verizon 企业数据泄露”。
7. Leary, “Verizon Enterprise Data Breach.”
8 . Gorrell,“盐湖县数据泄露暴露了 14,200 人的信息。”
8. Gorrell, “Salt Lake County Data Breach Exposed Info of 14,200 People.”
9 . McKinney,“数据泄露暴露了 3,000 多名 TCC 员工的信息。”
9. McKinney, “Data Breach Exposes Information on More than 3,000 TCC Employees.”
10 . 芒卡斯特,“菲律宾的每个选民都暴露在大型黑客攻击中。”
10. Muncaster, “Every Voter in Philippines Exposed in Mega Hack.”
11 . Siciliano,“雅虎数据泄露:近 5 亿人受到影响”;“10 亿雅虎账户因数据泄露而受损 | 身份力量。
11. Siciliano, “Yahoo Data Breach: Almost 500 Million Affected”; “1 Billion Yahoo Accounts Compromised in Data Breach | IdentityForce.
12 . Khandelwal,“世界上最大的 1 Tbps DDoS 攻击是从 152,000 个被黑的智能设备发起的。”
12. Khandelwal, “World’s Largest 1 Tbps DDoS Attack Launched from 152,000 Hacked Smart Devices.”
在“又一次中断”之后,一家大型服务提供商致电其主要供应商,提出要求:将这 16 名(具体姓名)人员派到我们的办公室一周,以便他们重新设计我们的网络并防止此类中断发生。再次发生。听到该计划后,被选中参加“实地考察”的 16 名工程师之一说服供应商管理层不要派出如此大的设计团队来重建该网络。原因?当你在一个房间里有 16 位设计师时,你将看到 1 个人在白板上画画,另外 15 个人在擦除。
After “yet another outage,” a large service provider called on its primary vendor with a demand: send these 16 (specific, by name) people to our office for one week so they can redesign our network and prevent this kind of outage from ever happening again. On hearing of the plan, one of the 16 engineers chosen for the “field trip” convinced the vendor’s management not to send such a large group of designers in to rebuild this network. The reason? When you get 16 designers in a room, what you will have is 1 person drawing on the white board, and the other 15 erasing.
这个故事概括了网络设计的主要真理之一:设计网络没有唯一正确的方法。当然,有更好的设计和更差的设计,但是构建网络的协议和系统是经过设计的面对不完美的情况要非常宽容。如果不是这样,网络将非常脆弱,第一个网络设备或链路故障就会完全失败。那么,既然你几乎可以“将东西拼凑在一起”并“使其发挥作用”,那么是什么让设计变得更好或更差呢?本章将重点回答这个问题。
This story encapsulates one of the primary truths about network design: there is no one right way to design a network. Of course, there are better designs and worse designs, but the protocols and systems that networks are built out of are designed to be very forgiving in the face of imperfect conditions. If they were not, networks would be very fragile, failing completely with the first network device or link failure. So, given you can pretty much “slap stuff together” and “make it work,” what makes a design better or worse? This chapter will focus on answering this question.
然而,要找到答案,需要从问题空间本身开始,研究一系列想法和概念。
Getting to the answer, however, will require working through a range of ideas and concepts, beginning with the problem space itself.
设计师试图解决什么问题?虽然这似乎是一个显而易见的问题,但在设计过程中却常常被遗忘。一般发生的情况是这样的:
What problem are designers trying to solve? While this might seem like an obvious question, it is far too often forgotten in the design process. What generally happens is this:
1. 工程师被赋予了一系列需要实现的目标。
1. Engineers are given a set of objectives to fulfill.
2. 一个粗略的草图由两个或三个可能的解决方案组成。
2. A rough sketch is made of two or three possible solutions.
3、从技术层面对这两种或三种解决方案进行考虑和比较。
3. These two or three solutions are considered and compared at a technical level.
4. 选择解决方案,确定并部署设备和配置。
4. A solution is chosen, and the equipment and configurations are determined and deployed.
这必然不是一个糟糕的过程;相反,这个过程往往会把每一步都应该考虑的一些要点推到“角落”。第一阶段一旦过去,就不会再重蹈覆辙。就像安全性通常被留到最后一样,网络设计的更大问题常常被推到流程的第一部分,而在选择设备和构建配置的真正“极客级的东西”到来时被遗忘。最好在设计的每个阶段都使用基本问题集作为背景。
This is not a bad process, necessarily; rather, this process tends to take some points that should be considered in every step and push them into a “corner.” Once the first stage is passed, it is never revisited. Just like security is often left to the last, the larger questions of network design are often pushed to the first part of the process, and forgotten about when the real “geek-level stuff” of selecting equipment and building configurations arrives. It is better to use the basic problem set as a backdrop in every stage of the design.
网络的存在是为了解决业务和现实世界的问题。当您在选择哪个转发引擎更好,或者哪个网络软件包具有最多的功能时,网络设计需要回答的基本问题是什么——这个问题是在网络设计的每个阶段都应该牢记的。设计过程?
The network exists to solve business and real-world problems. When you are in the thick of choosing which forwarding engine is better, or which network software package has the greatest number of features, what is the fundamental question that network design needs to answer—the one that should be kept in mind at every stage of the design process?
为解决实际问题所需的应用程序提供传输的最便宜且最灵活的方式是什么?
What is the least expensive and most flexible way to provide transport for the applications required to solve a real-world problem?
This single question has many components, of course; it is worth looking at some of them in more detail.
最便宜的真正含义是什么?这实际上是一个具有许多不同方面的难题——给出的任何答案都可能是大量水晶球凝视和依赖假设的结果。工程师经常记得包括的一些费用是
What does the least expensive really mean? This is actually a difficult question with many different facets—and any answer given is probably the result of a good deal of crystal ball gazing and reliance on assumptions. Some expenses that engineers often remember to include are
•硬件:物理构建网络所需的实际物理设备、布线、电源、机架和其他设备的成本。这应包括通过提供商连接网络的物理设备的成本,例如备用设备和工具。
• Hardware: The cost of the actual, physical devices, cabling, power, racks, and other gear required to physically build the network. This should include the cost of physical devices to connect the network through providers, for instance, spare equipment and tools.
•软件:网络操作系统、路由堆栈、监控工具的许可成本以及任何持续维护成本。
• Software: The cost of licenses for the network operating system, routing stack, monitoring tools, and any ongoing maintenance costs.
•服务:拥有呼叫技术援助中心、设计服务、异地备份服务等的成本。
• Services: The cost of having a technical assistance center to call, design services, offsite backup services, etc.
这些类型的成本,无论是资本支出 (CAPEX) 还是运营支出 (OPEX),通常都很好理解,尽管它们很难预测。然而,许多成本通常不包含在任何类型的规划中,甚至也没有被很好地理解。
These kinds of costs, both in terms of capital expenses (CAPEX) and operational expenses (OPEX) are generally well understood, even if they are difficult to predict. A number of costs, however, are not generally included in any sort of planning, nor are they even well understood.
其中许多可以被描述为不同形式的机会成本,它们通常围绕着设计复杂性驱动的困难操作和网络修改。具体来说,在网络出现故障时启动网络、修改网络或排除故障并修复网络期间,网络要么无法运行,要么不能完全支持业务。
Many of these can be described as different forms of opportunity costs, and they often revolve around difficult operations and network modification driven by design complexity. Specifically, during the time it takes to bring the network up, modify the network, or troubleshoot and repair the network when there is a failure, the network is either not operational or is not fully supporting the business.
机会成本的关键问题是衡量它们的难度。例如,内容提供商往往拥有非常好的系统来衡量不太理想的网络的影响,因为实际上有可能将较慢的页面速度(例如)转化为减少的参与度或减少的产品购买(转化率)。关键是要了解您正在构建的网络支持的特定业务,并找到某种方法来衡量以非最佳方式运行的网络的影响。这种反馈经常作为一组无法量化的抱怨返回给网络工程团队。作为一名网络工程师,您不能等到客户抱怨才去寻找创造性的方法来衡量网络在支持业务方面的有效性。你必须在这方面积极主动,
The key problem with opportunity costs is how difficult they are to measure. Content providers, for instance, tend to have very good systems for measuring the impact of a less than optimal network, because it is actually possible to translate slower page speed (for instance) to reduced engagement or reduced product purchases (the conversion rate). The key is to learn the specific business that the network you are building supports, and find some way to measure the impact of a network running in a less than optimal way. This kind of feedback comes back to the network engineering team as a set of unquantifiable complaints far too often. As a network engineer, you cannot wait for your customers to complain to find creative ways to measure how effective the network is at supporting the business. You must be proactive in this area, especially if you expect to convert the network from a cost center into a strategic asset for the company.
最灵活的真正含义是什么?网络往往作为“多用途系统”出售,这意味着它们可以支持任何应用程序和任何变化在业务中。然而,在现实生活中,这种情况很少发生。网络设计的灵活性有两个敌人:僵化和叉车。
What does the most flexible really mean? Networks tend to be sold as “multiple-use systems,” meaning they can support any application and any change in the business. In real life, however, this is rarely true. There are two enemies of flexibility in network design: ossification and the forklift.
骨化是硬化(石化)的过程,将木头、骨头,甚至一袋面粉变成石头。原本柔韧、可变的物体变得难以修改,并且容易出现大规模、退化且经常出现意外的故障状态。随着时间的推移,网络也会以多种方式僵化:
Ossification is the process of hardening (petrification) that turns wood, bones, and even sacks of flour into stone. What was originally a pliable, changeable object becomes something difficult to modify, and prone to massive, degenerative, and often unexpected failure states. Networks ossify over time, as well, in several ways:
• 新系统叠加在旧系统之上,形成了难以理解、难以排除故障且难以替换的交互界面。
• New systems are layered on top of old, creating interaction surfaces, which are difficult to understand, difficult to troubleshoot, and difficult to replace.
• 不断调整网络以满足不断扩大的应用和要求。另一方面,随着旧的应用程序和要求被删除,相应的调整和书呆子旋钮不会从网络中删除。由此产生的配置增加通常没有记录,并且还会在网络中创建大量状态和不必要的交互界面。
• The network is successively tuned to meet the requirements of an ever-expanding array of applications and requirements. On the other hand, as older applications and requirements are removed, the corresponding tweaks and nerd knobs are not removed from the network. The resulting accretion of configuration often goes undocumented, and also creates a lot of state and unnecessary interaction surfaces in the network.
• 网络架构通常是在供应商设计和产品、“行业最佳实践”和业务需求的交叉点上构建的。供应商(可以理解)不断尝试将未来的销售转化为当前的销售,并将整个垂直行业转化为他们的产品。与此同时,企业领导层经常出去阅读报告和文章,告诉他们最新的趋势和产品,然后问:“我们为什么不部署这些?” 其结果是政治、趋势和实际考虑因素的考验,通常会导致设计不太理想,特别是在灵活性方面。
• Network architectures are often built at the intersection of vendor designs and products, “industry best practices,” and business needs. Vendors are (understandably) constantly trying to build future sales into current sales, and to convert the entire vertical to their product. At the same time, business leadership is often out reading reports and articles telling them about the latest trends and offerings, and then asking, “Why don’t we deploy these?” The result is a crucible of politics, trends, and practical considerations, often resulting in a less than optimal design, particularly in the area of flexibility.
僵化的系统通常是脆弱的;虽然它们从外面看起来“坚如磐石”,但面对看似微小的变化或压力,它们会出乎意料地、而且太容易地破裂。
Ossified systems are generally fragile; while they appear to be “rock solid” from the outside, they break unexpectedly, and far too easily, in the face of what might appear to be small amounts of change or pressure.
叉车是构建灵活网络中一个相关但不同的问题。叉车问题的前半部分是用在单个设备中包含所有硬件和软件的设备构建网络的普遍趋势;路由器作为单个设备购买,其中包含路由协议、转发软件和硬件、电源和冷却组件等。系统各个部分之间的这种联系往往会导致强大的垂直集成。虽然来自许多不同供应商的许多系统将以重要的方式协同工作,但大多数供应商驱动的基于设备的系统只有当其设备在单一供应商环境中使用时才会启用特殊功能和操作模式。
The forklift is a related but different problem in building flexible networks. The first half of the forklift problem is the general tendency to build networks out of appliances containing all the hardware and software in a single appliance; a router is purchased as a single appliance containing the routing protocols, the forwarding software and hardware, the power and cooling components, etc. This tie between the various parts of the system tends to result in strong vertical integration. While many systems from many different vendors will work together in a significant way, most vendor-driven appliance-based systems will enable special features and modes of operation only when their appliances are used in a single vendor environment.
升级此类系统的过程通常涉及叉车,因此行业简写为“叉车升级”。要改变控制平面架构,必须更换整个设备。如果控制平面作为一个整体在某种程度上,由于特殊功能仅在一小部分设备上可用,整个网络必须进行叉车式升级。至少可以说,这是一个具有挑战性的情况。相反,为实现最大灵活性而设计的网络是使用某种形式的分解来设计的。图 22-1说明了这一点。
The process of upgrading such systems usually involves a forklift—hence the industry shorthand a forklift upgrade. To change the control plane architecture, the entire device must be replaced. If the control plane acts as an integrated whole in some way, with special features available just on a small range of devices, the entire network must be forklift upgraded. This is a challenging situation, to say the least. Networks designed for the maximum flexibility are designed, instead, using some form of disaggregation. Figure 22-1 illustrates.
图 22-1说明了网络所有权方面的许多不同选项:
Figure 22-1 illustrates a number of different options in terms of network ownership:
• 如果您购买了构建网络所需的一切,包括使用供应商专有的控制平面扩展,那么您的网络就是供应商驱动的。在这种情况下,如果供应商更改其产品架构或其控制平面背后的理念以支持某些新的网络架构,那么您必须更改网络设计以遵循供应商。
• If you buy everything needed to build the network, including using vendor-proprietary extensions to the control plane, then your network is vendor driven. In this case, if the vendor changes its product architecture, or the philosophy behind its control plane in order to support some new network architecture, then you must change your network design to follow the vendor.
• 相反,您可以从供应商处购买所有硬件和软件,但使用基于开放标准的协议来互连设备并构建网络。从理论上讲,这应该使您的网络供应商独立(尽管在现实生活中并不总是这样)。当您部署此类网络时,重要的是要跟上新标准,以及单个供应商实现的新功能是否以允许多个供应商的路由器成功互操作的方式实现。
• You can, instead, buy all your hardware and software from vendors, but use open-standards-based protocols to interconnect the equipment and build a network. Theoretically, this should make your network vendor independent (although it does not always work out this way in real life). When you are deploying this kind of network, it is important to keep up with new standards, and whether new features implemented by a single vendor are implemented in a way that allows routers from multiple vendors to successfully interoperate.
• 在分解模型中,您可以从一个或多个供应商处购买网络操作系统和硬件,并依赖路由堆栈的开源(无论是否针对您的网络进行“调整”)实现。该模型中还有各种其他“模式”,例如依赖开源网络操作系统以及开源路由堆栈,或者可能依赖开源网络操作系统并创建自己的控制平面。
• In the disaggregated model, you might purchase a network operating system and hardware from one or more vendors, and rely on an open source (whether “tweaked” for your network or not) implementation of the routing stack. There are various other “modes” within this model, as well, such as relying on an open source network operating system as well as an open source routing stack, or perhaps relying on an open source network operating system and creating your own control plane.
• 在推出您自己的模型时,您只需从供应商处购买硬件,甚至可能为供应商提供一套构建硬件所遵循的标准。
• In the roll your own model, you are just buying hardware from a vendor—and perhaps you are even giving the vendor a set of standards to build hardware to.
在现实世界中,大多数网络运营商使用的模型可能被称为陈旧且发霉的模型,这意味着您购买设备并将其留在原处,直到它崩溃。对于许多运营商来说,这是管理网络增长和长期管理的“正常”方式。
In the real world, the model that most network operators use might be called old and moldy, which means you buy your equipment and leave it in place until it falls apart. This is the “normal” way of managing network growth and management over time for a lot of operators.
这些模型都不像图 22-1所暗示的那样清晰。这些不同的模型之间有许多不同的等级。作为网络工程师、设计师或架构师,重要的是要了解除了从供应商处购买之外还有许多其他选择。选择一种为您的公司提供最大价值和灵活性的模型非常重要,而不是简单地依赖其他公司正在做的事情或以前做过的事情。
None of these models are as clear cut as Figure 22-1 implies; there are many different gradations between these various models. The important point for you, as a network engineer, designer, or architect, is to understand there are many other options than simply purchasing from a vendor. It is important to choose a model that provides the most value and flexibility for your company, rather than simply relying on what other companies are doing, or what has been done before.
图22-2说明了网络设计中灵活性和业务适应性方面的另一个问题。
Figure 22-2 illustrates another problem in the area of flexibility and fit to business in network design.
图 22-2展示了公司和网络规模(或容量)相互叠加的情况。网络,特别是设备驱动的网络,往往只能进行大块升级,并且通常通过某种形式的叉车式升级。另一方面,该业务往往呈井喷式增长,偶尔会出现紧缩。当业务规模和网络能力不匹配时(现实世界中几乎总是如此),就会出现以下两种情况之一:
Figure 22-2 illustrates company and network size (or capacity) overlaid on top of one another. Networks, particularly appliance-driven networks, tend to be upgradable only in large chunks, and often through some form of forklift upgrade. The business, on the other hand, tends to grow in spurts, with occasional retrenchment. When the business size and network capability are mismatched—which is almost all the time in the real world—one of two situations is occurring:
• 网络规模过小,导致企业无法支持尽可能多的客户或运营负载;这会产生机会成本。
• The network is undersized, which holds the business back from being able to support as many customers, or as much operational load; this creates an opportunity cost.
• 网络规模过大,这意味着资金被不必要地花费——投资于基础设施的资金本可以(最有可能)以其他方式投资,从而获得更高的利润。这是机会成本的另一种形式。
• The network is oversized, which means money is being spent unnecessarily— money invested in infrastructure that could have (most likely) been invested in some other way more profitably. This is another form of opportunity cost.
网络设计越灵活,网络工程人员就越能够使业务规模和网络容量紧密跟踪。这可以减少浪费和失去的机会。
The more flexible a network design is, the more the network engineering staff will be able to make the business size and the network capacity track closely. This can reduce waste and lost opportunity.
了解业务方面和将业务需求转化为技术解决方案是两件不同的事情。网络设计人员拥有哪些工具可以将这些业务问题应用到网络设计中?模块化是主要工具;正如大自然经常在复杂系统之间设置一个阻塞点一样,网络设计人员也使用阻塞点将网络设计中的复杂性与复杂性分开。图 22-3说明了这个概念。
Understanding the business side and translating business requirements into technical solutions are two different things. What tools does the network designer have to apply these business problems to network design? Modularity is the primary tool; just as nature often places a choke point between complex systems, network designers use choke points to separate complexity from complexity in network designs. Figure 22-3 illustrates this concept.
模块化背后的想法是将单个问题分成多个部分,分别解决每个部分,然后使用一组连接器允许信息在整个系统中从边缘流动到边缘。这个概念在网络工程中几乎无处不在。您可能至少认识到以下内容:
The idea behind modularity is to split up a single problem into multiple pieces, solving each piece separately, and then using a set of connectors to allow information to flow edge to edge in the overall system. This concept is used almost everywhere in network engineering; you might recognize at least the following:
• 将协议分层为功能和拓扑单元;在 Day 的 RINA 模型中,有两个协议,每个协议都有两个功能,相互叠加。每个链路上、每对主机之间以及每对应用程序之间都有一对这样的协议。
• Layering protocols into functional and topological units; in Day’s RINA model, there are two protocols, each with two functions, layered on top of one another. There are a pair of such protocols across each link, between each pair of hosts, and between each pair of applications.
• 在单个物理拓扑之上构建虚拟拓扑层。通过在每个虚拟拓扑中携带拓扑的子集和可达性信息以及将策略限制到附加到虚拟拓扑的目的地来控制复杂性。
• Building layers of virtual topologies on top of a single physical topology. Complexity is controlled by carrying a subset of the topology and reachability information in each virtual topology, and restricting policy to the destinations attached to the virtual topology.
• 在链路状态协议中将网络分解为洪泛域,并使用L1/L2 中间系统或区域边界路由器将网络重新修补在一起。泛洪域边界表示拓扑信息被汇总的边界,并且可达性信息可能是在该边界处。
• Breaking up a network into flooding domains in a link state protocol, and patching the network back together with L1/L2 intermediate systems, or Area Border Routers. The flooding domain boundary represents the boundary at which the topology information is summarized, and reachability information may potentially be.
信息隐藏和模块化是密切相关的概念。每当信息被隐藏时,模块就会被创建。仅考虑网络设计,网络模块化可以为业务问题提供多种解决方案。一些具体的例子:
Information hiding and modularity are closely related concepts. Any time information is being hidden, modules are being created. Just considering network design, modularizing a network can offer a number of solutions for business problems. Some specific examples:
• 阻塞点提供了一个可以隐藏信息的点,以控制控制平面中状态的范围和速度。这反过来又允许网络扩展。
• Choke points provide a point at which information can be hidden to control the scope and speed of state in the control plane. This, in turn, allows the network to scale.
• 阻塞点提供了数据包必须移动的点,为实施数据包转发策略提供了便利的位置,例如服务质量标记、以安全为重点的过滤等。
• Choke points provide a point at which packets must move, providing a convenient place to implement packet forwarding policy, such as Quality of Service marking, security-focused filtering, etc.
• 如果可以将模块分类为“类”,并且每个类在模块内具有可重复的设计和配置,则可以简化网络的至少某些部分的构建,以便为特定目的部署正确的模块。例如,如果图22-3中的模块1和模块3都是规模和范围相似的园区网络,则可以采用相同的方式进行构建,从而节省设计和部署工作。可重复的模块还有助于使网络规模与业务更加紧密地结合,因为可以根据需要作为完整单元添加或删除模块。
• If modules can be sorted into “classes,” with each class having repeatable designs and configurations within the module, then building at least some parts of the network can be simplified into deploying the right module for a particular purpose. For instance, if modules 1 and 3 in Figure 22-3 are both campus networks of a similar size and scope, then they can be built in the same way, saving design and deployment effort. Repeatable modules also help make the network scale more closely with the business, as modules can be added or removed as complete units as needed.
• 模块可以按“代”进行分类,随着时间的推移,较新的设计会替换使用较旧模块的模块。这有助于减少叉车升级问题的影响,从而让网络更灵活地面对业务变化。
• Modules can be sorted into “generations,” with newer designs replacing modules using older modules over time. This can help reduce the impact of the forklift upgrade problem, and hence allow the network to be more flexible in the face of business changes.
模块化有这么多优点,那么有缺点吗?在网络设计的所有领域中,重要的是要记住:
With all the positive points around modularization, are there negatives? In all areas of network design, it is important to remember:
如果你还没有找到平衡点,那说明你还没有足够努力地寻找。
If you have not found the tradeoff, you have not looked hard enough.
尽管您可能在本书中多次阅读过此陈述,但它适用于网络设计和操作的许多领域,因此非常值得重复。一些经验法则可能会有所帮助。
Although you have probably read this statement several times in this book, it applies to so many areas of network design and operation, it is well worth repeating. Several rules of thumb might be helpful.
首先是模块的大小。如果您使模块太大,您(可能)会降低可重复性,没有给自己足够的地方来隐藏信息(这会损害扩展),也没有给自己足够的地方将策略插入网络。如果模块太小,就会降低信息隐藏过程的有效性。找到合适的尺寸就像找到温度合适的粥一样——没有明确的答案(或者,也许更好的答案是,一个袋子里可以装多少个气球?)。
The first is the size of the modules. If you make the modules too large, you are (probably) reducing the repeatability, not giving yourself enough places to hide information (which harms scaling), and not giving yourself enough places to insert policy into the network. If you make the modules too small, you reduce the effectiveness of the information hiding process. Finding the proper size is much like finding porridge that is just the right temperature—there is no clear-cut answer (or, perhaps a better answer, how many balloons fit in a bag?).
第二是优化权衡。每次隐藏信息时,您(很可能)都会失去网络中某种形式的优化。这是一条基本规则,仅仅因为现实的形状而被内置于网络和协议设计中。
The second is the optimization tradeoff. Each time you hide information, you (more than likely) lose some form of optimization in the network. This is a fundamental rule, built into network and protocol design just because of the shape of reality.
第三种方法是使用网络复杂性作为确定模块边界放置位置的指南。一般来说,这里最好的规则是将复杂性与复杂性分开。例如,如果您有一个连接到大型部分网状网络核心的大型主干和叶子数据中心结构,那么最好将它们放入两个不同的模块中。
The third is using network complexity as your guide in determining where to place module boundaries. In general, the best rule here is to separate complexity from complexity. For instance, if you have a large-scale spine and leaf data center fabric connected to a large partial mesh network core, it is probably best to put them into two different modules.
最后,有了这个背景,我们就可以回到本章开始时提出的最初问题:什么是好的网络设计?第一点应该是显而易见的:它应该满足前面列出的业务要求。这意味着网络应该以最低的实际成本提供运行业务所需的连接,并且应该能够轻松适应业务的规模和范围。灵活性本质上也意味着规模。
Finally, with this background, it is possible to return to the original question this chapter began by asking: what is a good network design? The first point should be obvious: it should fulfill the business requirements laid out previously. This means the network should provide the connectivity needed to run the business at the lowest practical cost, and it should be easily adaptable to the size and scope of the business. Flexibility inherently implies scale, as well.
第二点是:好的网络设计会优雅地退化,而不是跌落悬崖。图 22-4说明了这一点。
A second point is this: good network designs degrade gracefully, rather than falling over a cliff. Figure 22-4 illustrates.
任何在一组条件下表现良好但在非最佳条件下无法适应任何变化的系统都是脆弱的;它要么已经僵化,要么它最初的设计很糟糕。有时用于此类系统的另一个术语是“稳健但脆弱”,这意味着这些系统表面上稳健,但在适当的条件下确实脆弱。
Any system that performs well under one set of conditions and fails to adapt to any change under less than optimal conditions is fragile; it has either ossified, or it was initially poorly designed. Another term sometimes applied to these sorts of systems is Robust Yet Fragile, which means these systems are apparently robust, and yet truly fragile under the right conditions.
第三点是:良好的网络设计可以让运营人员快速发现并修复故障。换句话说,良好的网络设计与平均修复时间 (MTTR) 相互作用。
A third point is this: good network designs allow operational staff to quickly find and repair failures. In other words, good network design interacts with the Mean Time to Repair (MTTR).
关于如何将网络分解为模块的经验法则是一个好的开始,但它们并不能为您提供完整的图片。虽然它们确实为您提供了关于模块化目标的一套很好的想法,但它们并没有提供基于有意的、系统的网络视图的框架。层次结构是一种可以叠加在模块化之上的设计模式,或者更确切地说是模块化的一种模式,通常用于构建大规模网络。模块化的要点如下:
The rules of thumb on how to break up a network into modules are a good start, but they do not give you an entire picture. While they do give you a good set of ideas around what the goals of modularization should be, they do not provide a framework grounded in an intentional, systemic view of the network. Hierarchy is one design pattern that can be overlaid on top of modularization, or rather one pattern of modularization, and is often used to build large-scale networks. The essential points of modularization are as follows:
• 将网络的功能分解为不同的部分。
• Break up the functionality of the network into distinct pieces.
• 围绕每个部分或目的构建模块。
• Build modules around each piece, or purpose.
• 通过一组专用“互连”模块以大致中心辐射型拓扑结构连接这些模块。
• Connect these modules through a set of special-purpose “interconnection” modules in a roughly hub-and-spoke topology.
图22-5用于考虑基本的三层分层设计模型,该模型通常用于中等规模的网络。
Figure 22-5 is used to consider the basic three-layer hierarchical design model, which is often used in medium-scale networks.
图22-5中的模块1是网络核心。分配给核心模块的主要(实际上是唯一)功能是在分布层模块之间尽快转发流量。该核心中不应该有控制平面策略;应该只存在针对服务质量和虚拟化的差异化转发规则的转发策略。该模块往往是运输和设备方面较为复杂的模块之一,并且也是“独特的”,因为只有一个核心,因此抵消其他领域的简化可以使核心保持可维护性。
Module 1 in Figure 22-5 is the network core. The primary—really the only— function assigned to the core module is to forward traffic as quickly as possible between distribution layer modules. There should be no control plane policy in this core; there should only be forwarding policy aimed at differentiated forwarding rules for Quality of Service and virtualization. This module will tend to be one of the more complex in terms of transports and equipment, and is also “unique” because there is just one core, so offsetting simplification in other areas allows the core to remain maintainable.
模块 2 到 5 被视为分布层。该层主要负责承载区域内的流量和所有控制平面策略。该层的四个模块应尽可能相似;物理配置至少应该在整个网络的每个模块的一代内完全可重复。标准化分布层中所有模块的硬件配置简化了问题的一部分,从而允许这一层的复杂性在于控制平面策略,这通常是一个复杂的问题。
Modules 2 through 5 are considered the distribution layer. This layer is primarily responsible for carrying traffic within a region and all control plane policy. Each of the four modules in this layer should be as similar as possible; the physical configuration, at least, should be completely repeatable within a generation of each module across the entire network. Standardizing the hardware configuration across all of the modules in the distribution layer simplifies one part of the problem, allowing the complexity in this layer to reside in control plane policy, which is normally a complex problem.
模块 6 到 13 位于接入层。接入层中的模块主要负责提供与网络的连接。因此,接入层中可能会有多种模块。例如,可能存在一种用于支持校园环境的模块,另一种用于支持数据中心结构,以及另一种用于支持互联网和外联网访问。网络这一层应重点关注流量分类和安全访问。接入层内同类模块之间的物理和逻辑配置应尽可能可重复。
Modules 6 through 13 reside in the access layer. Modules in the access layer are primarily responsible for providing connections into the network. Because of this, there will likely be many kinds of modules in the access layer. For instance, there may be one kind of module for supporting campus environments, another for supporting data center fabrics, and another for supporting Internet and extranet access. Traffic classification and security access should be focused in this layer of the network. Physical and logical configurations should be as repeatable as possible between modules of the same kind within the access layer.
层次结构的另一种形式是两层层次结构,通常用于较小的网络,如图22-6所示。
An alternate form of hierarchy is the two-layer hierarchy, often used in smaller networks, illustrated in Figure 22-6.
两层层次结构似乎是删除了访问层的三层层次结构,但还涉及其他细微的变化。聚合层(模块 2 到 5)主要负责提供网络连接以及数据包过滤和分类。安全功能也往往集中在聚合层。与三层层次结构一样,这些模块应尽可能在物理上和逻辑上可重复。
The two-layer hierarchy appears to be a three-layer hierarchy with the access layer removed—but there are other subtle changes involved, as well. The aggregation layer, modules 2 through 5, are primarily responsible for providing connectivity into the network, as well as packet filtering and classification. Security functions also tend to be focused in the aggregation layer. As with the three-layer hierarchy, these modules should be physically and logically repeatable where possible.
与三层模型一样,其核心侧重于高速转发流量。缺失的部分似乎是控制平面策略;在核心/聚合模型中,控制平面策略是沿着核心模块和聚合模块之间的边缘实施的。
The core, as in the three-layer model, is focused on forwarding traffic at high speed. The missing piece seems to be control plane policy; in the core/aggregation model, control plane policy is implemented along the edge between the core and aggregation modules.
图 22-7显示了递归分层层次结构的示例,通常用于超大规模网络中。
Figure 22-7 shows an example of a recursively layered hierarchy, often used in very large-scale networks.
图 22-7说明了一个看似更简单的核心/聚合层次结构。然而,更仔细地观察聚合层中的模块 2,也会暴露内部核心/聚合层次结构。各层功能相同;模块 2 (6) 内的核心重点是在模块 2 内的不同聚合模块(7 至 9)之间快速转发流量;模块 2 内的聚合模块专注于提供连接、安全性和分类;控制平面策略是在核心/聚合边缘实施的。这可能看起来是三层层次结构的稍微修改版本,但有几个重要的区别:
Figure 22-7 illustrates what appears to be a simpler core/aggregation hierarchy. Looking more closely at module 2 in the aggregation layer, however, exposes an internal core/aggregation hierarchy, as well. The layer functions are the same; the core within module 2 (6) is focused on forwarding traffic quickly between the different aggregation modules (7 through 9) within module 2; the aggregation modules within module 2 are focused on providing connection, security, and classification; and control plane policy is imposed at the core/aggregation edge. This may appear to be a slightly modified version of a three-layer hierarchy, but there are several important differences:
• 区域政策按区域处理,在该领域提供更大的灵活性。当然,这也意味着模块在逻辑上不太可能重复,因此这种设计更加复杂。
• Regional policy is handled regionally, providing more flexibility in this area. Of course, this also means the modules are less likely to be logically repeatable, so there is more complexity in this kind of design.
• 可以在层内多次构建层;模块6也可以包含核心模块和聚合模块。因此,递归分层架构是构建和理解网络拓扑的非常强大的范例。
• It is possible to build layers within layers more than once; module 6 may contain core and aggregation modules, as well. Because of this, the recursive layering architecture is a very powerful paradigm for building and understanding network topologies.
网络设计的另一个基本模式是网络拓扑。虽然看起来可能有无数种可能的拓扑,但这种无限的变化可以适应几种基本类型。控制平面倾向于收敛于基于基本组件(环、网和三角形)的任何特定拓扑类型。本节将考虑一些基本拓扑类型及其一些特征。
Another basic pattern in network design is the network topology. While it might seem there would be an infinite number of possible topologies, there are a few basic kinds that this infinite variety will fit into. Control planes tend to converge on any particular topology type based on the basic components—rings, meshes, and triangles. This section will consider a few basic topology types and some of their characteristics.
环形拓扑是最容易设计和理解的拓扑之一。它们也是最便宜的选择,特别是在涉及长途链路时,因此它们往往在广域网中占主导地位。
Ring topologies are among the simplest to design and understand. They are also the least expensive option, especially when long haul links are involved, so they tend to predominate in wide area networks.
环形拓扑已扩展到大尺寸;添加节点的额外成本是最小的。通常,只需要一台新路由器(或交换机)、移动一条电路并添加另一条新电路即可。通过仔细规划,可以在环中添加新节点,而不会对整体网络运营产生任何实际影响。图22-8描述了向环添加节点。
Ring topologies have been scaled to large sizes; the additional cost to add a node is minimal. Generally, one new router (or switch), moving one circuit, and adding another new circuit is all that is needed. With careful planning, the addition of a new node into the ring can be accomplished without any real impact to overall network operations. Figure 22-8 depicts adding a node to a ring.
向环中添加新节点会增加通过环的总跳数(在本例中从 4 跳到 5),并且确实将可用带宽分散到更多设备上。然而,环上的每个设备仍然只有两个邻居;这种恒定的邻居计数是环形网络扩展特性背后的秘密。
Adding new nodes to the ring increases the total hop count through the ring (from 4 to 5 in this case), and it does spread the available bandwidth across more devices. However, every device on the ring still has just two neighbors; this constant neighbor count is much of the secret behind the scaling properties of ring networks.
随着环规模的增加,管理服务质量和最佳流量变得困难。图 22-9说明了这一点。
As ring size increases, it becomes difficult to manage Quality of Service and optimal traffic flows. Figure 22-9 illustrates.
图22-9中,假设 F 有一个互联网协议语音 (VoIP) 流连接到 H,而 G 有一些大文件传输(可能是完整备份)。这些流之一需要非常小的带宽、低延迟和少量的抖动;另一种需要大量带宽,但可以容忍大量延迟和抖动。然而,这两个流共享 [D,E] 链路。一种选择可能是强制流量沿环的“背面”流动,以避免同一链路上存在两种流量的问题,但这需要使用多协议标签交换 (MPLS) 等技术对两个流之一进行隧道传输。另一种选择可能是使用精心设计的服务质量 (QoS) 机制来确保两个流可以共存。然而,这两种解决方案都增加了控制平面和转发过程的复杂性。
In Figure 22-9, assume F has a voice over Internet Protocol (VoIP) stream connected to H, while G has some large file transfer (perhaps a complete backup). One of these streams requires very small amounts of bandwidth, low delay, and small amounts of jitter; the other requires large amounts of bandwidth but can tolerate a lot of delay and jitter. These two streams, however, share the [D,E] link. One option may be to force traffic along the “back side” of the ring to avoid the problem of having two kinds of traffic on the same link, but this would require tunneling one of the two streams using something like Multiprotocol Label Switching (MPLS). Another option may be to use a well-designed Quality of Service (QoS) mechanism to ensure the two streams can coexist. Either of these two solutions, however, adds complexity into the control plane and forwarding process. In terms of complexity, then, the ring topology requires just two neighbors per device, but it can require a lot more traffic engineering work to support all the requirements that applications place on the network.
环可以承受环中任意位置的单一故障;任何两次失败都会导致环破裂。然而,单个故障可能会使上一节中考虑的交通工程问题变得更加难以管理;一次故障本质上会将一个环变成一个总线,每对连接的设备之间只有一条路径。在这种情况下,“中间段”将以非常负面的方式成为带宽阻塞点。
Rings can withstand a single failure anyplace in the ring; any two failures will cause the ring to split. However, a single failure can make the kinds of traffic engineering problems considered in the preceding section much more difficult to manage; a single failure essentially turns a ring into a bus, with just one path between each pair of connected devices. The “middle segment,” in this case, will be a bandwidth choke point in a very negative way.
环形拓扑的融合为真正理解曾经设计或部署的所有其他网络拓扑的融合奠定了基础。了解环形拓扑中路由收敛的原理后,可以轻松应用这些相同的原理来快速了解您可能遇到的任何其他拓扑的收敛。
Convergence in ring topologies lays the foundation for truly understanding the convergence of every other network topology ever designed or deployed. After you understand the principles of routed convergence in a ring topology, it’s simple to apply these same principles to quickly understand the convergence of any other topology you might encounter.
换句话说,要注意!
In other words, pay attention!
在考虑任何控制平面协议(路由或交换)如何在特定拓扑上运行时,要记住的关键点是根据“首要指令”进行思考:您不得循环数据包!这种对循环数据包的厌恶解释了环形拓扑的收敛特性。考虑图 22-10中的环。
The crucial point to remember when considering how any control plane protocol—routed or switched—will operate on a particular topology is to think in terms of the “prime directive”: thou shalt not loop packets! This aversion to looping packets explains the convergence properties of ring topologies. Consider the rings in Figure 22-10.
虽然通常认为路由是使用网络中的每个链路,但协议为每个目的地构建一个生成树。对于任何给定的目的地,特定链路会被阻止在路径之外,以防止转发到目的地的数据包在网络中循环。在图22-10所示的两个环中,标记了路由协议不会转发到2001:db8:3e8:100::/64的数据包的链路。
While it is common to think of routing as using every link in the network, protocols build a spanning tree per destination. For any given destination, specific links are blocked out of the path to prevent a packet forwarded toward the destination from looping in the network. In the two rings shown in Figure 22-10, the links over which a routing protocol will not forward packets toward 2001:db8:3e8:100::/64 are marked.
In the case of the four-hop ring toward 100::/64:
• B 和C 之间的链接似乎是单向通往B 的。
• The link between B and C appears to be unidirectional toward B.
• C 和D 之间的链接似乎是单向朝向D 的。
• The link between C and D appears to be unidirectional toward D.
为什么路由协议会以这种方式阻止这些链接?遵循首要指令——你不应该循环!
Why are these links blocked in this way by the routing protocol? To follow the prime directive—thou shalt not loop!
• 如果将发往 100::/64 的报文从 D 转发到 C,则该报文将环回 D。
• If a packet destined to 100::/64 is forwarded from D to C, the packet will loop back to D.
• 如果将目的地址为100::/64 的报文从B 转发到C,则该报文将环回D。
• If a packet destined to 100::/64 is forwarded from B to C, the packet will loop back to D.
在朝向 100::/64 的五跳环的情况下,G 和 H 之间的链路似乎完全被阻塞。但为什么会这样呢?假设发往 100::/64 中某个主机的数据包从 G 转发到 H。该数据包将正确转发到 F,然后转发到 E,最后转发到目的地本身。
In the case of the five-hop ring toward 100::/64, the link between G and H appears to be completely blocked. But why should this be? Suppose a packet destined to some host in 100::/64 is forwarded from G to H. This packet will be forwarded correctly to F, then to E, and finally to the destination itself.
但是,如果 G 将 100::/64 的流量转发到 H,并且 H 也将 100::/64 的流量转发到 G,该怎么办?结果是永久的路由环路。这意味着
But what if G is forwarding traffic to H for 100::/64, and H is also forwarding traffic to G for 100::/64? A permanent routing loop results. This means
• 如果A 和D 之间的链路发生故障,则D 无法将流量转发到100::/64,直到路由协议收敛。
• If the link between A and D fails, D has no way to forward traffic toward 100::/64 until the routing protocol converges.
• 如果E 和K 之间的链路发生故障,H 将无法将流量转发到100::/64,直到路由协议收敛。
• If the link between E and K fails, H has no way to forward traffic toward 100::/64 until the routing protocol converges.
为什么理解这一切如此重要?因为实际上您可以设想的任何类型的冗余拓扑最终都是由环组成(许多人认为全网状设计是一个例外,而 Clos 结构是例外)。换句话说,实际上世界上的每个网络拓扑都可以分为一组互连的环,并且每个环都将根据一组非常基本的规则聚合:
Why is all this so important to understand? Because virtually every topology you can envision with any sort of redundancy is, ultimately, made up of rings (full mesh designs are considered by many to be an exception, and Clos fabrics are exceptions). To put it another way, virtually every network topology in the world can be broken into some set of interconnected rings, and each of these rings is going to converge according to a very basic set of rules:
• 对于每个目的地,每个环都有一组不用于转发流量的链路。
• Every ring has, for each destination, a set of links not used to forward traffic.
• 环上任何链路或节点的故障都将导致流量丢弃(距离矢量)或形成环路(链路状态),直到路由协议收敛。
• The failure of any link or node on a ring will cause traffic to either be dropped (distance vector) or looped (link state) until the routing protocol converges.
鉴于路由协议收敛的速度与通知特定拓扑更改的路由器数量直接相关,因此与必须重新计算到任何给定目的地的最佳路径的路由器数量直接相关,控制平面收敛的第三条规则一个戒指是
Given the speed at which a routing protocol can converge is directly related to the number of routers notified of a particular topology change, and hence the number of routers that must recalculate their best paths to any given destination, a third rule for control plane convergence on a ring is
• 环越大,路由协议收敛的速度就越慢(从而停止将数据包扔到地板上或解决由此产生的微环路)。
• The larger the ring, the more slowly the routing protocol will converge (and thus stop throwing packets on the floor or resolve the resulting microloop).
这三个规则几乎适用于您遇到的每个拓扑。找到环,你就找到了网络融合的最基本要素。
These three rules apply to virtually every topology you encounter. Find the rings, and you have found the most basic element of network convergence.
虽然环形拓扑的部署成本最低且扩展最简单,但全网状拓扑往往部署成本最高且扩展最困难。然而,全网状拓扑比环形拓扑更容易理解(在收敛和扩展方面)。图 22-11说明了全网状拓扑。
While ring topologies are the cheapest to deploy and the simplest to scale, full mesh topologies tend to be the most expensive to deploy and the most difficult to scale. Full mesh topologies, however, are simpler to understand (in terms of convergence and scaling) than ring topologies. Figure 22-11 illustrates a full mesh topology.
从 2001:db8:3e8:101::/64 到 2001:db8:3e8:100::/64 有十条路径:
There are ten paths from 2001:db8:3e8:101::/64 to 2001:db8:3e8:100::/64:
1. [E,A]
1. [E,A]
2. [E、C、A]
2. [E,C,A]
3. [E、D、A]
3. [E,D,A]
4. [E、B、A]
4. [E,B,A]
5. [E、D、B、A]
5. [E,D,B,A]
6. [E、D、C、A]
6. [E,D,C,A]
7. [E、C、B、A]
7. [E,C,B,A]
8. [E、C、D、A]
8. [E,C,D,A]
9. [E、B、C、A]
9. [E,B,C,A]
10. [E、B、D、A]
10. [E,B,D,A]
流量工程技术可用于将特定流量引导到任何这些路径上,从而使网络设计人员能够设计流量以获得最佳性能。通过网络的路径数量以及在一组节点之间构建完整网格所需的链路数量由下式给出
Traffic engineering techniques can be used to direct specific traffic onto any of these paths, allowing the network designer to design traffic flows for optimal performance. The number of paths through the network and the number of links required to build a complete mesh between a set of nodes are given by
N(n−1)/2
N(n−1)/2
对于图22-11中的五节点网络,这是
For the five-node network in Figure 22-11, this is
5(5−1)/2 = 10
5(5−1)/2 = 10
然而,全网状网络的这一特性也指出了该拓扑的弱点:规模和费用。这两个弱点密切相关。添加到网络中的每个新节点都意味着添加与网络中已有节点一样多的链路。在图 22-11中向网络添加新节点意味着添加五个新链路以使新节点完全进入网格。
This property of full mesh networks, however, also points to the weaknesses of this topology: scale and expense. These two weaknesses are closely related. Each new node added to the network means adding as many links as there are nodes already in the network. Adding a new node to the network in Figure 22-11 would mean adding five new links to bring the new node fully into the mesh.
每增加一条新的链路不仅意味着一根电缆,还意味着使用一个新的端口,而新的端口意味着新的线卡,等等。每个新链路还代表一组新的邻居,供控制平面管理,从而增加内存利用率和处理要求。如果管理和部署得当,协议级技术可以减少全网状拓扑中的控制平面开销。开放最短路径优先 (OSPF) 和中间系统到中间系统(IS-IS) 都具有构建网状组的能力,将全网状拓扑视为类似于广播网络;网格边缘的少量路由器被指定将拓扑信息洪泛到网格上,而其余连接的路由器则被动监听。
Each new link added means not only a cable, but also a new port used, and new ports mean new line cards, and so on. Each new link also represents a new set of neighbors for the control plane to manage, increasing memory utilization and processing requirements. There are protocol-level techniques that can reduce the control plane overhead in full mesh topologies, if they are properly managed and deployed. Open Shortest Path First (OSPF) and Intermediate System to Intermediate System (IS-IS) both have the capability to build a mesh group, which treats the full mesh topology similar to a broadcast network; a small number of routers at the edge of the mesh are designated to flood topology information onto the mesh, while the remainder of the attached routers passively listen.
网络工程师经常遇到全网状拓扑的地方是虚拟覆盖,特别是在流量工程是部署虚拟覆盖的基本部分的部署中。尽管在构建全网状隧道时端口和链路成本会降低(或消除),但管理全网状网络和排除故障的成本仍然存在。
The one place where network engineers often encounter full mesh topologies is in virtual overlays, particularly in deployments where traffic engineering is a fundamental part of the reason for deploying the virtual overlay. Although the port and link costs are reduced (or eliminated) when building a full mesh of tunnels, the cost of managing and troubleshooting a full mesh remains.
更常用的网格变体是部分网格拓扑。在部分网格中,只有一些节点相互连接,通常基于测量的流量模式和感知的弹性要求。从收敛和扩展的角度来看,部分网状拓扑通常会简化为一组交互的环形拓扑。部分网格内的每个“环”将以与已经涵盖的环拓扑的描述相同的方式缩放和收敛。
A more commonly used mesh variant is a partial mesh topology. In a partial mesh, only some nodes are connected to one another, generally based on measured traffic patterns and perceived resilience requirements. Partial mesh topologies often reduce, in convergence and scaling terms, to a set of interacting ring topologies. Each “ring” within the partial mesh will scale and converge in the same way as the description of ring topologies already covered.
中心辐射型拓扑的构建方式正如其听起来的那样。有一个或多个中心路由器连接到更多数量的远程路由器。图 22-12说明了两种这样的拓扑。
Hub-and-spoke topologies are built in just the way they sound; there is one or more hub router that is connected to a much larger number of remote routers. Figure 22-12 illustrates two such topologies.
在图22-12中,从2001:db8:3e8:101::/64到2001:db8:3e8:100::/64,沿着[B,A]只有一条路径。如果该路径出现故障,则这两个网络之间的连接将失败;这称为单宿主网络。为了防止单点故障导致特定站点或应用程序完全中断,许多中心辐射型网络都设计有两个中心路由器,如图22-12右侧的网络所示;这称为双宿主网络。特别的通常使用技术来扩展此类网络以支持数千个远程站点,例如
In Figure 22-12, there is just one path from 2001:db8:3e8:101::/64 and 2001:db8:3e8:100::/64, along [B,A]. If this path fails, connectivity fails between these two networks; this is called a single-homed network. To prevent a single point of failure from causing a complete outage for a particular site or application, many hub-and-spoke networks are designed with two hub routers, as shown in the network on the right side of Figure 22-12; this is called a dual-homed network. Special techniques are often used to scale such networks to support thousands of remote sites, such as
• 向远程站点发送最少量的路由信息,例如仅发送默认路由。
• Sending the remote site a minimal amount of routing information, such as just a default route.
• 减少这些远程站点的邻居状态;例如,开放最短路径优先具有需求电路的概念,它允许中心路由器通告其路由信息一次,从而阻止通常需要同步数据库的定期重新洪泛。
• Reducing neighbor state toward these remote sites; for instance, Open Shortest Path First has the concept of a demand circuit, which allows the hub router to advertise its routing information once, blocking the periodic reflooding normally required to synchronize the database.
• 不计算通过远程站点的路由,因为它们永远不应该用于传输流量。例如,增强型内部网关路由协议 (EIGRP) 能够将远程站点路由器标记为存根,从而阻止计算通过远程站点的备用路径。OSPF 能够使用最大度量来标记通过远程站点的路由,这会阻止通过远程站点的路由。
• Not calculating routes through the remote sites, as they should never be used to transit traffic. For instance, the Enhanced Interior Gateway Routing Protocol (EIGRP) has the ability to mark a remote site router as a stub, which blocks calculation of alternate paths through the remote site. OSPF has the ability to mark a route through a remote site with the maximum metric, which discourages routing through the remote site.
如果没有这种特殊技术,双宿主远程站点将像三角形一样汇聚;有了它们,它将更像一个单宿主远程站点。由于管理大规模中心辐射型网络涉及扩展、配置和管理方面的困难,许多此类网络现在使用不同的选项构建,例如
Without such special techniques, a dual-homed remote site will converge like a triangle; with them, it will converge more like a single-homed remote site. Because of the scaling, configuration, and management difficulties involved with managing large-scale hub-and-spoke networks, many such networks are now built using different options, such as
• 服务提供商提供的服务,其中集线器和远程路由器实际上由服务提供商管理,并且客户接收通过服务提供商网络路由的正确路由信息和数据包。这将整个管理负载从客户转移到服务提供商。
• A service-provider-provided service, where the hub and remote routers are actually managed by a service provider, and the customer receives the correct routing information and packets routed through the service provider network. This transfers the entire management load from the customer to the service provider.
• 软件定义广域网(SD-WAN) 解决方案,可由服务提供商提供,或由网络运营商安装和管理。这些服务在标准互联网的“顶层”运行,使用隧道构建虚拟的中心辐射型或全网状网络。
• A Software-Defined Wide Area Network (SD-WAN) solution, which may be provided by a service provider, or installed and managed by the network operator. These services operate “over the top” of the standard Internet, using tunnels to build a virtual hub-and-spoke or full mesh network.
网络拓扑可以根据其属性以及拓扑的形状来描述。三个重要的概念是平面、非平面和规则。
Network topologies can be described in terms of their properties, as well as the shape of the topology. Three important concepts are planar, nonplanar, and regular.
平面拓扑可以使用单个平面来描述;这意味着链接不会以强制一个链接“跳过”另一个链接的方式交叉。在非平面拓扑中,无论拓扑如何布置,至少有两条链路会交叉。图22-13说明了这两个概念之间的区别。
Planar topologies can be described using a single plane; this means links do not cross in a way that forces one link to “hop” over another link. In a nonplanar topology, at least two links will cross no matter how the topology is arranged. Figure 22-13 illustrates the difference between these two concepts.
如图22-13所示,有4个网络,分别标记为A、B、C、D。网络A为平面拓扑;网络A为平面拓扑。两个链接不存在交叉点,因此需要一个链接“跳过”另一个链接,或者更确切地说需要第二个平面来准确表示。B 中的拓扑是非平面拓扑。
Four networks are shown in Figure 22-13, marked A, B, C, and D. Network A is a planar topology; there are no points at which two links cross, and hence would require one link “jumping” over the other—or rather requiring a second plane to accurately represent. The topology in B is a nonplanar topology.
在检查网络以发现它们是否具有平面或非平面拓扑时,请尝试重新排列链接以查看是否可以移动它们,以便没有两个链接交叉或重叠。例如,图 22-13中的网络 C看起来是非平面设计,因为两条链路在灰色虚线圆圈处交叉;然而,同一网络显示为 D,但其中一个链接已移动,因此它们不再交叉。B 中的链接无法重新排列,以防止以这种方式出现任何重叠。
When examining networks to discover if they have a planar or nonplanar topology, try rearranging the links to see if they can be moved so that no two links will cross or overlap. For instance, network C in Figure 22-13 appears to be a nonplanar design, because of the two links crossing at the gray dashed circle; however, the same network is illustrated as D, but with one of the links moved so they no longer cross. The links in B cannot be rearranged to prevent any overlap in this way.
常规拓扑有一个特征:它们由较小的重复拓扑组成。图 22-14说明了由四跳环(或立方体)组成的结构,这是一种常规拓扑。
Regular topologies have one characteristic: they are made up of smaller, repeating topologies. Figure 22-14 illustrates a fabric of four hop rings (or cubes), which is a regular topology.
在图22-14中,A、B、D、E四台路由器是同一拓扑中的一个小型四路由器环路。因为您可以挑选出的任何其他四路由器环路都可以替换网络中的任何其他四路由器环路,所以这是常规拓扑;任何一组四个路由器都可以移动到网络中的任何其他位置,而无需更改整个拓扑,并且可以通过简单地复制一个拓扑并将其添加回来来增加网络拓扑的大小。
In Figure 22-14, the four routers A, B, D, and E are a small four-router loop within the same topology. Because any other four-router loop that you can pick out can replace any other four-router loop in the network, this is a regular topology; any set of four routers could be moved anyplace else in the network without changing the overall topology, and the network topology can be increased in size by simply replicating one piece of the topology and adding it back on.
能够挑选出这些类型的拓扑有助于理解特定网络的汇聚方式以及可用的快速重新路由和其他选项。详细阐述这些经验教训需要比本章更多的篇幅,但了解这些不同的设计模式是一个很好的起点。
Being able to pick out these kinds of topologies is helpful in understanding the way a particular network will converge, and what kinds of fast reroute and other options are available. It would take much more space than is available here in this chapter to draw these lessons out in detail, but being aware of these different design patterns is a good place to start.
网络设计通常与网络安全的处理方式相同——留到最后一刻,尽可能快地完成,尽可能少地思考和大惊小怪。真正的设计,从业务需求开始,而不是速度和馈送,或端口和机架,在眼前的暴政中常常被忽视。“这个项目现在就需要完成,忘记设计的东西,让它发挥作用。” 这就是通向技术债务并最终导致网络崩溃和企业失败的道路。
Network design is often treated the same way network security is—left until the last moment, done as quickly as possible, with as little thought and fuss as possible. Real design, beginning with business requirements rather than speeds and feeds, or ports and racks, is often ignored in the tyranny of the immediate. “This project needs to be done now, forget the design stuff, just get it working.” This is the path to technical debt and—ultimately—crashed networks and failed businesses.
正确的网络设计需要采取系统的观点。本章虽然是简短的概述,但为您提供了开始思考设计问题和解决方案所需的一些基本思维方式和工具。下一章将通过考虑弹性和冗余来继续研究设计主题。
Proper network design needs to take a systemic view. This chapter, although a short overview, provides you with some of the basic mindsets and tools you need to start thinking through design problems and solutions. The next chapter will continue examining design topics by considering resilience and redundancy.
奥本海默,普里西拉。自上而下的网络设计。第三版。印第安纳州印第安纳波利斯:思科出版社,2010 年。
Oppenheimer, Priscilla. Top-Down Network Design. 3rd edition. Indianapolis, IN: Cisco Press, 2010.
怀特、拉斯和丹尼斯·多诺霍。网络架构的艺术:业务驱动的设计。第一版。印第安纳州印第安纳波利斯:思科出版社,2014 年。
White, Russ, and Denise Donohue. The Art of Network Architecture: Business-Driven Design. 1st edition. Indianapolis, IN: Cisco Press, 2014.
怀特、拉斯、阿尔瓦罗·雷塔纳和唐·斯莱斯。最优路由设计。第一版。印第安纳州印第安纳波利斯:思科出版社,2005 年。
White, Russ, Alvaro Retana, and Don Slice. Optimal Routing Design. 1st edition. Indianapolis, IN: Cisco Press, 2005.
怀特、拉斯和杰夫·坦苏拉。应对网络复杂性:利用 SDN、服务虚拟化和服务链的下一代路由。印第安纳州印第安纳波利斯:Addison-Wesley Professional,2015。
White, Russ, and Jeff Tantsura. Navigating Network Complexity: Next-Generation Routing with SDN, Service Virtualization, and Service Chaining. Indianapolis, IN: Addison-Wesley Professional, 2015.
1. 考虑文中提出的网络设计目标——构建最便宜、最灵活的方式来提供传输。该目标集与管理网络复杂性的状态/优化/表面 (SOS) 模型有何关系?
1. Consider the objective of network design laid out in the text—to build the least expensive, most flexible way to provide transport. How does this goal set relate to the State/Optimization/Surface (SOS) model of managing network complexity?
2. 研究现实世界中机会成本的例子。
2. Research an example of opportunity cost in the real world.
3. 解释为什么僵化的系统看起来结构良好且坚固,但实际上往往很脆弱。
3. Explain why ossified systems appear to be well built and solid, but are often actually fragile.
4. 从状态/优化/表面权衡三元组的角度考虑分类模型与独立于供应商的模型。任一模型都会增加状态吗?表面?以什么方式?
4. Consider the disaggregated versus the vendor-independent model in terms of the State/Optimization/Surface tradeoff triad. Would either model increase state? Surfaces? In what way?
5. 举例说明何时可以使用两层层次结构和三层层次结构。
5. Give examples of when you might use a two-layer hierarchy versus a three-layer hierarchy.
网络旨在支持应用程序,而应用程序又支持特定的业务需求(或者也许应用程序本身就是业务)。当网络宕机时,显然无法支持应用程序,但“宕机”是一个相当模糊的术语。“宕机”的种类比“根本不转发数据包”的种类还要多。本章提出的问题是
Networks are designed to support applications, which in turn support specific business needs (or perhaps the application itself is the business). When the network is down, it obviously cannot support applications, but “down” is a rather ambiguous term. There are more kinds of “down” than “not forwarding packets at all.” The question this chapter asks is
网络弹性是什么意思?
What does network resilience mean?
网络工程师可以使用多种工具来创建弹性网络。快速重路由、指数退避等快速收敛技术可以使网络收敛的速度产生较大差异。优雅重启是工程师可以使用的另一套工具,但本书并未介绍它们(请参阅本章末尾的“进一步阅读”部分,以获取更多信息)有关此主题的信息)。然而,冗余一直是各类工程师用来增强弹性的主要工具之一。
There are a number of tools network engineers can use to create a resilient network. Fast Reroute, Exponential Backoff, and other fast convergence technologies can make a large difference in the speed at which the network converges. Graceful restart is another set of tools engineers can use, but they are not covered in this book (see the “Further Reading” section at the end of this chapter for pointers to more information on this topic). Redundancy, however, has always been one of the primary tools engineers of every kind have turned to, to build in resilience.
那么,本章的第一部分将描述弹性;第二部分将考虑使用冗余来创建网络的弹性。
The first part of this chapter, then, will describe resilience; the second part will consider the use of redundancy to create resilience in a network.
性能缓慢和完全故障是与任何类型的网络故障相关的两个最常见的应用程序问题。网络的运行与这些类型的应用程序问题有何关系?图23-1将用于考虑这个问题的答案。
Slow performance and complete failure are the two most common application problems associated with network failures of any kind. How does the operation of the network relate to application problems of these kinds? Figure 23-1 will be used to consider the answers to this question.
在图 23-1中,假设 A 和 F 之间存在流经 [B,D] 链路的长期流量。如果 [B,D] 链接失败,驱动流的应用程序可能会看到几个结果:
In Figure 23-1, assume there is a long-standing flow between A and F flowing across the [B,D] link. If the [B,D] link fails, the application driving the flow could see several results:
•流程可能会完全失败。如果在所示的四个路由器中的任何一个上配置了某种形式的数据包或路由过滤,则 [B,D] 链路故障可能会导致 A 和 F 之间的连接完全丢失。在这种情况下,A 和 F 之间的流量将停止流动,应用程序将失败。
• The flow could fail entirely. If some form of packet or route filtering is configured at any of the four routers illustrated, a [B,D] link failure may result in a complete loss of connectivity between A and F. In this case, traffic between A and F will stop flowing, and the application will fail.
•端到端延迟可能会改变。一旦路由协议收敛到唯一的其他可用路径 [B,C,E,D],路径中就会添加两个以上的队列、两个以上的交换机以及两个以上的反序列化/序列化延迟。应用程序会将其视为网络延迟量的突然变化。
• The end-to-end delay could change. Once the routing protocol converges on the only other available path, [B,C,E,D], there will be two more queues, two more switches, and two more deserialization/serialization delays added to the path. The application will see this as a sudden change in the amount of delay across the network.
•抖动可能会增加。额外的队列、序列化和反序列化也可能增加网络的抖动。路由协议收敛时,某些数据包会出现不同程度的延迟,特别是在收敛过程中形成微环的情况下。总的影响可能看起来像是短暂的高抖动爆发,然后整个路径上的抖动普遍增加。
• Jitter could increase. The additional queues, serialization, and deserialization could also increase jitter through the network. Some packets will variably be delayed while the routing protocol converges, particularly if there is a microloop formed during the convergence process. The total impact will likely look like a short burst of high jitter, followed by a general increase in jitter across the path.
•数据包可能会被丢弃。在链路发生故障的那一刻,B 向 D 发送的任何数据包都可能会被丢弃。
• Packets could be dropped. Whatever packet is transmitted by B toward D just at the moment the link fails will likely be dropped.
•可能会传送重复的数据包。特别是如果在收敛过程中形成微环路,一个或两个数据包可能会通过网络传输两次。一个简单的例子是,如果 A 处的重传计时器在故障发生前设置得非常短,因此数据包在收敛期间延迟的时间比该计时器长,A 可能会在同一数据包的另一个副本已经“正在传输”时重传数据包,导致在 F 处收到同一数据包的两个副本。
• Duplicate packets could be delivered. It is possible, particularly if a microloop is formed during convergence, for one or two packets to be transmitted through the network twice. One simple example of this is if the retransmission timer at A is set very short before the failure, so packets are delayed longer than this timer during convergence, A could retransmit a packet while another copy of the same packet is already “in flight,” resulting in two copies of the same packet being received at F.
•数据包可能会无序传送。考虑一下如果在收敛过程中 B 和 C 之间形成微环路会发生什么情况。就在微循环解析时,数据包可能正从 C 转发到 B。如果发生这种情况,在数据包在 B 和 C 之间循环之前,由 A 发送并由 B 接收的数据包将在微循环中捕获的数据包之前由 B 转发。较早的数据包将在较晚的数据包之后传送。
• Packets could be delivered out of order. Consider what happens if a micro-loop forms between B and C during convergence. It is possible a packet is being forwarded from C toward B just at the moment the microloop resolves. If this happens, a packet transmitted by A and received by B, before the packet looping between B and C, will be forwarded by B before the packet caught in the microloop is. The earlier packet will be delivered after the later packet.
在分组交换网络中解决这些问题实际上是不可能的。事实上,虽然多年来已经开发了许多网络技术,试图防止这些故障的发生,但大多数技术最终都会给网络增加如此多的复杂性,从而对网络性能产生总体负面影响。权衡是各种工程中的硬性规则。网络工程也不例外。
It is virtually impossible to resolve these problems in a packet switched network. In fact, while many networking technologies have been developed over the years seeking to prevent these failures from ever occurring, most of these technologies end up adding so much complexity to the network that they have an overall negative impact on network performance. Tradeoffs are the hard-and-fast rule in engineering of all kinds; network engineering is no exception.
弹性很容易理解:网络不会失败,也不会产生上一节讨论的应用程序中的各种影响。然而,复原力需要衡量和理解。本节将考虑衡量弹性的几种方法。本节将考虑三项具体措施:
Resilience is easy enough to understand: the network does not fail nor produce the kinds of effects in applications discussed in the previous section. Resilience needs to be measured, as well as understood, however. This section will consider several ways in which resilience is measured. Three specific measures will be considered in this section:
• 平均故障间隔时间 (MTBF)
• The Mean Time Between Failures (MTBF)
• 平均修复时间 (MTTR)
• The Mean Time to Repair (MTTR)
• 可用性
• Availability
图 23-2用于说明这些概念。
Figure 23-2 is used to illustrate these concepts.
图 23-2显示了三种不同的弹性衡量标准。
Three different measures of resilience are shown in Figure 23–2.
MTBF 是系统发生故障之间的时间量。只需将任意时间片内的故障数除以总时间量,即可得到系统的 MTBF(在此时间片内)。在图 23-2中,MTBF 是第一次和第二次故障之间的时间。您使用的时间片越长(系统不发生变化),MTBF 就越准确。
MTBF is the amount of time between failures in a system. Just divide the number of failures in any slice of time into the total amount of time, and you have the MTBF for the system (during this time slice). In Figure 23-2, the MTBF is the amount of time between the first and second failures. The longer the time slice (without changes in the system) you use, the more accurate the MTBF will be.
MTTR 是系统发生故障后恢复所需的时间。要找到 MTTR,请将所有中断的总长度除以中断总数。要找到图 23-2中的 MTTR :
MTTR is the amount of time it takes to bring the system back up after it has failed. To find the MTTR, divide the total length of all outages by the total number of outages. To find the MTTR in Figure 23-2:
• 找出第一次故障和第一次修复之间的时间长度。
• Find the length of time between the first failure and the first repair.
• 找出第二次故障和第二次修复之间的时间长度。
• Find the length of time between the second failure and the second repair.
• 将这两个时间长度相加(或相加)。
• Sum (or add) these two lengths of time.
• 将此总数除以中断次数(在本例中为两次)。
• Divide this total by the number of outages—in this case, two.
可用性是系统应该运行的总时间(不包括中断)除以系统不运行的时间。要从 MTBF 和 MTTR 获得可用性,您可以将 MTBF 作为单个运行周期并将其除以 MTTR,如下所示:
Availability is the total time the system should have been operational (without counting outages) divided by the amount of time the system was not operational. To get to availability from MTBF and MTTR, you can take the MTBF as a single operational period and divide it by the MTTR, like this:
笔记
Note
大多数时候,您会看到可用性是通过将正常运行时间与停机时间相加,然后将结果除以停机时间来计算的。然而,这会得到相同的数字,因为总正常运行时间加上总停机时间应该(就网络而言)是网络应该运行的总时间。您可能还会看到使用平均无故障时间 (MTTF) 的概念来表达这一点,即 MTBF 减去 MTTR,因此将 MTTF 和 MTTR 相加应得到与 MTBF 相同的数字。
Most of the time, you’ll see availability calculated by adding the uptime to the downtime, and then dividing the result by the downtime. This arrives at the same number, however, because the total uptime added to the total downtime should (in the case of networks) be the total amount of time the network should have been operational. You might also see this expressed using the idea of Mean Time to Failure (MTTF), which is just the MTBF minus the MTTR—so adding the MTTF and the MTTR should result in the same number as the MTBF.
可用性通常用 9 的个数来表示;例如,一个网络可能有四个 9 的可用性,而另一个网络可能有五个 9 的可用性。这是网络可用时间的简写:
Availability is often expressed as a number of nines; for instance, one network may have four 9s of availability, while another may have five 9s of availability. This is shorthand for the fraction of time the network is available:
• 可用性中的四个9 表示网络在99.99% 的时间内可用,或者每年大约有一个小时不运行。
• Four 9s of availability means the network is available 99.99% of the time, or is not operational for about an hour a year.
• 五个 9 的可用性意味着网络在 99.999% 的时间内可用,或者每年有大约 5.2 分钟不运行。
• Five 9s of availability means the network is available 99.999% of the time, or is not operational for about 5.2 minutes each year.
可用性的概念需要根据特定网络的可用性的含义来考虑。是否需要整个网络都关闭才能被视为不可用?或者也许服务需求对用户的特定应用程序集不可用?或者,当某些特定应用程序(或应用程序集或用户集等)性能下降时,网络可能会被认为无法运行。回答此类问题对于定义网络可用性非常重要。
The concept of availability needs to be considered in light of the meaning of availability for a particular network. Does the entire network need to be down to be considered unavailable? Or perhaps service needs be unavailable to a particular set of applications of users? Or perhaps the network can be considered inoperable when some particular application (or set of applications, or set of users, etc.) suffers degraded performance. Answering these kinds of questions is very important in defining network availability.
还有其他网络弹性衡量标准,虽然无法给出具体数字,但通常与更常用的衡量标准一样重要。这些措施中有一点幽默感,但这种幽默感是有大量严肃经验支持的。
There are other measures of network resilience for which no number can be produced, but are often just as important as the more commonly used measures. There is a good bit of humor in these measures, but the humor is backed by a good deal of serious experience.
平均错误间隔时间( MTBM) 用于衡量导致应用程序性能问题或任何规模的网络故障的错误之间平均间隔多长时间。MTBM与网络中配置的复杂性有关,包括广泛分散的转发设备的配置如何交互。广泛使用的经验法则称为凌晨 2 点规则:如果您无法在凌晨 2 点向主要语言与您不同的技术支持工程师解释配置,则可能值得重新考虑配置。
The Mean Time Between Mistakes (MTBM) which measures how long, on average, it is between mistakes causing an application performance problem or network failure of any size. The MTBM is related to the complexity of configurations in the network, including how the configurations of widely dispersed forwarding devices interact. A widely used rule of thumb is called the 2 a.m. rule: if you cannot explain the configuration at 2 a.m. to a technical support engineer whose primary language is not the same as yours, it might be worth reconsidering the configuration.
平均无罪时间(MTTI) 是证明网络对于特定应用程序问题没有出现故障所需的时间量。证明这一点通常需要大量“之前”和“之后”网络测量,以表明网络中的任何变化都不会导致观察到的问题。密切关注应用程序可以“看到”上一节中考虑的网络故障的各种方式非常重要。
The Mean Time to Innocence (MTTI) is the amount of time required to prove the network is not at fault for a particular application problem. Proving this often requires a lot of “before” and “after” network measurements to show none of the changes in the network could cause the observed problem. It is important to pay close attention to the various ways an application can “see” a failure in the network considered in the preceding section.
也许网络工程师用来在网络设计中创建弹性的主要工具是添加冗余。在考虑添加冗余以提高弹性时要理解的主要概念之一是单点故障。图 23-3说明了这一点。
Perhaps the primary tool used by network engineers use to create resilience in a network design is adding redundancy. One of the primary concepts to understand when considering adding redundancy to increase resilience is the single point of failure. Figure 23-3 illustrates.
在图23-3中,从A和G的角度来看,网络有两条冗余路径。但这种冗余是一种欺骗。网络中心 D 处存在单点故障。如果 D 本身发生故障,则 A 和 G 之间将无法传输任何流量。图 23-4说明了添加第二条(冗余)链路将如何提高弹性。
In Figure 23-3, from the perspective of A and G, the network has two redundant paths. But the redundancy is a deception; there is a single point of failure at D in the center of the network. If D itself fails, no traffic can flow between A and G. Figure 23-4 illustrates how adding a second (redundant) link would improve resilience.
在图 23-4中,现在有两条通过网络的并行路径,一条通过 [B,E],另一条通过 [C,F]。如果两个链路经常发生故障(它们具有相似的可用性),则两个链路同时发生故障的可能性很低。虽然两个链路同时失败的可能性很小,但这种情况还是有可能发生的。如果您知道每个链路的可用性,则可以使用以下公式计算两个链路的组合可用性,从而计算两个链路同时失败的可能性:
In Figure 23-4, there are now two parallel paths through the network, one through [B,E], and a second through [C,F]. If both links fail on a fairly regular basis (they have similar availability), the odds are low that both links will fail at the same time. While there may be a small chance of both links failing at the same time, there is some chance this will happen. You can calculate the chance of both links failing at the same time, if you know the availability for each link, by calculating the combined availability of both links, using this formula:
将网络中的每个并行项代入a 1、a 2等,就可以计算出整个系统的可用性a t。这将告诉您在一年的时间里,这两个链接可能会出现故障的频率。
Substitute each parallel item in the network into a1, a2, etc., and you can calculate the availability of the entire system, at. This will tell you how often, over the course of a year, both links are likely to be down.
笔记
Note
这称为设备并行链路的可用性计算;也可以计算串联连接的设备或链路的总可用性,但这个讨论超出了本书的范围。
This is called the availability calculation for parallel links of devices; it is also possible to calculate the total availability of devices or links connected in series, but this discussion is beyond the scope of this book.
有一条经验法则在这里非常有效,无需完成所有数学计算;这就是所谓的减半规则。如果您有两条并行连接的路径,每条路径每年的总停机时间为 1 秒,则总停机时间可能是任一链路可能停机时间的一半。那么,两个链路的总停机时间应约为 500 毫秒。添加第三个链接将使该数字再次减半至 250 毫秒左右。虽然可用性本质上与停机时间相反,但增加冗余很快并不会增加可用性。如果您从具有四个 9 可用性的单个链路开始(因此它将不可用,或者每年大约下降 5 分钟),并行添加第二个链路将意味着该对链路现在每年大约有 2.5 分钟不可用。
There is a rule of thumb that works pretty well here without working through all of the math; it is called the halving rule. If you have two paths connected in parallel, each with a total downtime each year of 1 second, then the combined total downtime is likely to be half of the probable downtime of either link. The combined downtime of both links, then, should be around 500ms. Adding a third link will halve this number again to around 250ms. Although availability is essentially the opposite of downtime, increasing redundancy quickly produces very little increased availability. If you begin with a single link with four 9s of availability—so it would be unavailable, or down about 5 minutes in each year—adding a second link in parallel will mean the pair of links is now unavailable about 2.5 minutes each year. Adding a third link reduces the unavailable time to 1.25 minutes per year, and a fourth reduces the unavailable time to about 37 seconds.
每个并行添加的链路从控制平面的角度来看也增加了网络的复杂性;必须形成更多的邻接,有更多的路径需要计算,有更多的数据库组需要同步等等。这些都会至少减慢控制平面的收敛速度。目前尚不清楚通过增加并行链路来减少停机时间是否可以通过增加收敛时间来完全抵消。经验表明,两个链接通常是最佳的,三个链接在特殊情况下效果很好,四个链接需要仔细计算以确保附加链接不会降低而不是增加整体可用性。当然,在结构中(例如在数据中心)中调整协议部署的情况下,也有一些例外。
Each link added in parallel also increases the complexity of the network from the perspective of the control plane; more adjacencies must be formed, there are more paths to calculate through, there are more sets of databases to synchronize, etc. Each of these will slow down the convergence of the control plane at least some small amount. There is no clear point where decreasing downtime by increasing parallel links will be completely offset by increasing convergence time. Experience shows two links is often optimal, three links is good in exceptional situations, and four needs a careful look at the math to ensure the additional links are not decreasing overall availability, rather than increasing it. There are some exceptions, of course, in the case of tuned protocol deployments in fabrics (such as in a data center).
由于收益递减,您根本无法仅通过冗余来构建真正具有弹性的系统。真正的弹性必须构建到整个网络和整个堆栈中,系统的每个部分都发挥自己的作用,从应用程序到控制平面再到冗余链路。
Because of diminishing returns, you simply cannot build a truly resilient system through redundancy alone. Real resilience must be built into the entire network, and the entire stack, with each part of the system playing its own role, from applications to control planes to redundant links.
链路是寻找单点故障的关键位置,也是经常引入冗余以提高弹性的位置。然而,添加并行链路并不总能提高弹性。图 23-5说明了这一点。
Links are one crucial place to look for single points of failure—and one place where redundancy is often introduced to increase resilience. Adding parallel links does not always increase resilience, however; Figure 23-5 illustrates.
在图23-5中,网络运营商从两个不同的提供商处购买了链路来提供A和H之间的连接。从外部看,这两个链路完全无关。它们终止于不同的位置,由两个不同的提供商管理,等等。但是,在传输路径中的某个位置,两个提供商都通过单根光纤租用了虚拟电路,并且两条链路都通过该单光链路。反铲将这根光纤拉出地面(反铲衰落)将使两条电路都瘫痪。
In Figure 23-5, the network operator has purchased links from two different providers to provide connectivity between A and H. From the outside, these two links appear completely unrelated; they terminate in different locations, are managed by two different providers, etc. However, someplace in the transit path, both providers have leased virtual circuits over a single fiber, and both links pass through this single optical link. A backhoe pulling this single fiber out of the ground—a backhoe fade—will take both circuits down.
每当虚拟电路铺设在物理基础设施上时,就会出现共享风险链路组 (SRLG)。此外,SRLG 非常难以发现和规划,特别是在动态路由的分组交换网络中。有一些系统可以在单个运营商的网络中计算 SRLG,这对于防止 SRLG 在数据中心结构或公司网络中引起问题非常有用,但它们超出了本书的范围。
Any time virtual circuits are laid over a physical infrastructure, there are Shared Risk Link Groups (SRLGs). Further, SRLGs are surprisingly difficult to discover and plan around, particularly in dynamically routed packet switched networks. There are systems for calculating SRLGs within a single operator’s network, which can be very useful for preventing SRLGs from causing a problem in data center fabrics or corporate networks, but they are outside the scope of this book.
网络工程师应该了解 SRLG,并在可能的情况下谨慎地围绕它们进行规划,特别是与为提高弹性而添加的冗余相关的计划。
Network engineers should be aware of SRLGs, and careful to plan around them where possible, particularly in relation to redundancy added to increase resilience.
很多时候,问题并不是链路,而是链路所连接的相当复杂的网络设备。有多种解决方案可用于解决网络设备问题,包括
Many times links are not the problem, but the rather complex network devices the links connect to. There are several solutions available to resolve problems with network devices, including
• 并行运行多个设备并允许控制平面绕过故障。这本质上将任何复杂性从设备软件转移到网络和在网络之上运行的应用程序,这在许多情况下都是有效的设计选择。
• Running multiple devices in parallel and allowing the control plane to route around failures. This essentially transfers any complexity from the device software into the network and applications running on top of the network—a valid design choice in many situations.
• 使用平滑重启(有关平滑重启的更多信息,请参阅本章末尾的“进一步阅读”部分)来减少时间在设备重新启动或其他一些短期故障的情况下需要重新聚合控制平面。在正常重启的情况下,每个设备都会保持其转发状态,一旦控制平面进程重新启动,就会重新同步并重新计算通过网络的一组无环路路径。
• Using Graceful Restart (see the “Further Reading” section at the end of the chapter for more information on graceful restart) to reduce the amount of time required to reconverge the control plane in the case of a device reboot or some other short-lived failure. In the case of graceful restart, each device maintains its forwarding state, resynchronizing and recalculating the set of loop-free paths through the network once the control plane processes have restarted.
• 使用在线软件升级(ISSU) 或无中断重启来重新启动控制平面,而不会完全影响数据包转发。
• Using In-Service Software Upgrades (ISSU), or hitless restarts to restart the control plane without impacting packet forwarding at all.
平滑重启(GR)和ISSU依赖于设备在控制平面重启时能够在硬件中转发流量;硬件必须能够保存转发表并转发数据包,而无需控制平面将新的路由信息输入到转发表中。在没有控制平面的情况下进行转发存在一定的风险,因为在控制平面重新启动时网络拓扑可能会发生变化,从而导致环路,直到控制平面重新收敛。这是由于数据库拓扑不同步而发生的微循环的另一个实例。
Graceful Restart (GR) and ISSU rely on the device being able to forward traffic in hardware while the control plane is restarting; the hardware must be able to hold a forwarding table, and forward packets, without the control plane feeding new routing information into the forwarding table. There is some amount of risk in forwarding without the control plane, as the network topology can change while the control plane is restarting, causing a loop until the control plane reconverges. This is another instance of a microloop occurring because of topology unsynchronized databases.
这些解决方案中的每一种都有优点和缺点——每种解决方案都适用于某些情况,而不适用于其他情况——但工程师应该了解这些工具及其在构建弹性网络问题中的应用。
Each of these solutions has advantages and disadvantages—each one is applicable in some situations and not in others—but engineers should be aware of these tools and their application to the problem of building a resilient network.
虽然很少使用,但有时会部署双平面和多平面核心以确保最高级别的可用性。图 23-6说明了这两种类型的内核。
While they are rarely used, dual plane and multiplanar cores are sometimes deployed to ensure the highest levels of availability. Figure 23-6 illustrates these two types of cores.
在图 23-6中,每个内核都用不同类型的虚线表示,以便更容易看到这两个内核。在这两种核心类型中,一切都是不同的:
In Figure 23-6, each core is represented with a different kind of dashed line to make it easier to see both of the cores. In both of these core types, everything is different:
• 来自两个不同供应商的设备,或至少两条使用两种不同协议实现的不同硬件线路,用于防止单个错误或故障类型影响整个网络。
• Equipment from two different vendors, or at least two different hardware lines using two different protocol implementations, is used to prevent a single bug or kind of failure from impacting the entire network.
• 使用两种不同的内部网关协议,例如开放最短路径优先(OSPF) 和中间系统到中间系统(IS-IS),每个核心使用一种协议,因此单个协议中的问题不会影响整个网络。
• Two different interior gateway protocols, such as Open Shortest Path First (OSPF) and Intermediate System to Intermediate System (IS-IS), are used, one for each core, so a problem in a single protocol cannot impact the entire network.
• 与两个不同的提供商(每个核心一个)签订合同,提供站点之间的链接。
• Two different providers, one for each core, are contracted to supply the links between the sites.
通过创建两个完全独立的核心,您可以避免与单一文化相关的问题,或避免导致问题在整个网络中传播的错误。
By creating two completely separate cores, you can avoid the problems associated with monocultures, or a bug that allows a problem to propagate throughout the network.
这两种磁芯类型之间的主要区别在于并联链路,它们在图 23-6右侧的多平面磁芯插图中表示为点划线,例如 C 和 D 之间的弯曲链路。这些并联链路被设置为非常高的度量,因此仅在没有其他可用路径时才使用它们转发流量。外部网关协议,例如边界网关协议(BGP),用于将整个网络连接在一起;每个站点边缘路由器(例如 A)将有两条通往网络中任何给定目的地的路由。第一个将通过一个核心上的 BGP 来学习,第二个将通过第二个核心上的 BGP 来学习。
The primary difference between these two core types is shunt links, which are represented as dash-dot in the multiplanar core illustration on the right in Figure 23-6, such as the curved link between C and D. These shunt links are set to a very high metric, so they are used to forward traffic only if there is no other path available. An exterior gateway protocol, such as the Border Gateway Protocol (BGP), is used to tie the entire network together; each site edge router, such as A, would have two routes to any given destination in the network. One would be learned through BGP over one core, and the second through BGP over the second core.
MTTR 可分为两部分:
MTTR can be broken down into two pieces:
• 网络恢复在所有可到达目的地之间转发流量所需的时间
• The time it takes for the network to resume forwarding traffic between all reachable destinations
• 将网络恢复到原始设计和运行所需的时间
• The time it takes to restore the network to its original design and operation
第一个定义涉及机器级信息过载;控制平面中的信息越少,网络收敛的速度就越快。第二个与操作员信息过载有关;配置越一致,越容易理解网络应该是什么样子,运营商就能越快地追踪和发现任何网络问题。MTTR和模块化之间的关系如图23-7所示。
The first definition relates to machine-level information overload; the less information there is in the control plane, the faster the network is going to converge. The second relates to operator information overload; the more consistent configurations are, and the easier it is to understand what the network should look like, the faster operators are going to be able to track down and find any network problems. The relationship between MTTR and modularization can be charted as shown in Figure 23-7.
从单一平面故障域转变为更加模块化的设计,发现和修复网络中问题所需的时间迅速减少,从而推动MTTR 下降。然而,在某个点上,额外的模块化开始增加 MTTR,将网络分解为更小的域会导致网络变得更加复杂。要理解这种现象,请考虑这样一种网络情况,其中每个网络设备(例如路由器或交换机)都已成为自己的故障域(想象一下完全配置有静态路由且没有动态路由协议的网络)。很容易看出这种情况与单个大平坦故障域的情况没有区别。如何沿着 MTTR 曲线找到正确的点?答案总是“这取决于情况”,但制定一些一般规则很重要。
Moving from a single flat failure domain into a more modularized design, the time it takes to find and repair problems in the network decreases rapidly, driving the MTTR down. However, there is a point at which additional modularity starts increasing MTTR, where breaking the network into smaller domains causes the network to become more complex. To understand this phenomenon, consider the case of a network where every network device, such as a router or switch, has become its own failure domain (think of a network configured completely with static routes and no dynamic routing protocol). It is easy to see there is no difference between this case and the case of a single large flat failure domain. How do you find the right point along the MTTR curve? The answer is always going to be, “it depends,” but it is important to develop some general rules.
首先也是最重要的,任何给定故障域的正确大小永远不会是整个网络(除非网络确实非常小)。几乎任何规模的网络都可以而且应该被分成多个故障域。
First and foremost, the right size for any given failure domain is never going to be the entire network (unless the network is really and truly very small). Almost any size network can, and should, be broken into more than one failure domain.
其次,给定故障域的正确大小始终取决于控制平面协议的进步、处理能力的进步以及其他因素。多年来,对于网络世界中 OSPF 区域的最佳大小存在长期而激烈的争论。单个路由器可以处理多少个 LSA?SPF 在给定大小的数据库上运行的速度有多快?经过多年对 OSPF 运行方式的优化以及普通路由器处理能力的提高,这一争论通常已被事件所克服。
Second, the right size for a given failure domain is always going to depend on advances in control plane protocols, advances in processing power, and other factors. There were long and hard arguments over the optimal size of an OSPF area within the network world for years. How many LSAs could a single router handle? How fast would SPF run across a database of a given size? After years of work optimizing the way OSPF runs, and increases in processing power in the average router, this argument has generally been overcome by events.
随着时间的推移,随着技术的进步,单个故障域的最佳大小将会增加。随着时间的推移,随着网络规模的增加,单个网络内的最佳故障域数量将趋于保持不变。这两种趋势往往会相互抵消,因此大多数网络在其整个生命周期中最终都会出现大约相同数量的故障域,即使它们不断增长和扩展以满足不断增长的业务需求。
Over time, as technology improves, the optimal size for a single failure domain will increase. Over time, as networks increase in size, the optimal number of failure domains within a single network will tend to remain constant. These two trends tend to offset one another, so most networks end up with about the same number of failure domains throughout their life, even as they grow and expand to meet the ever-increasing demands of the business.
那么多大才算太大呢?从基本规则开始:围绕策略需求构建模块并将复杂性与复杂性分开。得到线路后围绕这两件事进行绘制,并且您已经根据业务部门、地理位置和其他因素添加了自然边界,您就有了一个坚实的起点来确定故障域边界应该在哪里。
So how big is too big? Start with the basic rules: building modules around policy requirements and separating complexity from complexity. After you get the lines drawn around these two things, and you’ve added natural boundaries based on business units, geographic locations, and other factors, you have a solid starting point for determining where failure domain boundaries should go.
从这一点出发,考虑哪些服务需要比其他服务更加隔离,这样它们就有更好的生存率,并考虑测量网络的性能以确定是否存在太大的故障域。
From this point, consider which services need to be more isolated than others, simply so they will have a better survivability rate, and look to measure the network’s performance to determine if there are any failure domains that are too large.
虽然冗余是工程师构建网络弹性的“首选工具”,但冗余所带来的负面影响与积极影响一样多,必须予以管理。弹性需要的不仅仅是冗余链路和设备;它必须包括许多其他技术,涉及从物理链路到控制平面再到应用程序本身的整个网络堆栈。
While redundancy is the “go-to tool” for engineers building resilience into a network, redundancy has as many negative aspects that must be managed as it does positive aspects. Resilience requires much more than redundant links and devices; it must include many other techniques that involve the entire network stack from the physical links, through the control plane, and into the application itself.
与安全性一样,弹性必须内置于网络中,而不是最后固定在网络上。
Resilience, as with security, must be built into the network, rather than bolted on at the very end.
帕帕迪米特里乌,迪米特里。“共享风险链接组的推断。” 互联网草案。互联网工程任务组,2001 年 11 月。https: //datatracker.ietf.org/doc/html/draft-many-inference-srlg-02。
Papadimitriou, Dimitri. “Inference of Shared Risk Link Groups.” Internet-Draft. Internet Engineering Task Force, November 2001. https://datatracker.ietf.org/doc/html/draft-many-inference-srlg-02.
皮莱-埃斯诺、帕德玛和约翰·莫伊。平滑 OSPF 重启。征求意见 3623。RFC 编辑器,2003。https: //rfc-editor.org/rfc/rfc3623.txt。
Pillay-Esnault, Padma, and John Moy. Graceful OSPF Restart. Request for Comments 3623. RFC Editor, 2003. https://rfc-editor.org/rfc/rfc3623.txt.
雷克特、雅科夫和拉胡尔·阿加瓦尔。具有 MPLS 的 BGP 平滑重启机制。征求意见 4781。RFC 编辑器,2007 年。https: //rfc-editor.org/rfc/rfc4781.txt。
Rekhter, Yakov, and Rahul Aggarwal. Graceful Restart Mechanism for BGP with MPLS. Request for Comments 4781. RFC Editor, 2007. https://rfc-editor.org/rfc/rfc4781.txt.
Rekhter、雅科夫、约翰·斯卡德尔、斯里哈里·S·拉马钱德拉、恩克·陈和雷克斯·费尔南多。BGP 的平滑重启机制。征求意见 4724。RFC 编辑器,2007 年。https: //rfc-editor.org/rfc/rfc4724.txt。
Rekhter, Yakov, John Scudder, Srihari S. Ramachandra, Enke Chen, and Rex Fernando. Graceful Restart Mechanism for BGP. Request for Comments 4724. RFC Editor, 2007. https://rfc-editor.org/rfc/rfc4724.txt.
托雷尔、温迪和维克多·阿维拉。“平均故障间隔时间:解释和
Torrell, Wendy, and Victor Avelar. “Mean Time Between Failure: Explanation and
标准”。白皮书。装甲运兵车。访问日期:2017 年 5 月 13 日。http: //www.apc.com/salestools/VAVR-5WGTSB/VAVR-5WGTSB_R1_EN.pdf。
Standards.” White Paper. APC. Accessed May 13, 2017. http://www.apc.com/salestools/VAVR-5WGTSB/VAVR-5WGTSB_R1_EN.pdf.
———。“对数据中心基础设施进行有效的 MTBF 比较。” 白皮书。装甲运兵车。访问日期:2017 年 5 月 13 日。http ://www.apc.com/salestools/ASTE-5ZYQF2/ASTE-5ZYQF2_R1_EN.pdf。
———. “Performing Effective MTBF Comparisons for Data Center Infrastructure.” White Paper. APC. Accessed May 13, 2017. http://www.apc.com/salestools/ASTE-5ZYQF2/ASTE-5ZYQF2_R1_EN.pdf.
怀特、拉斯和丹尼斯·多诺霍。网络架构的艺术:业务驱动的设计。第一版。印第安纳州印第安纳波利斯:思科出版社,2014 年。
White, Russ, and Denise Donohue. The Art of Network Architecture: Business-Driven Design. 1st edition. Indianapolis, IN: Cisco Press, 2014.
1. 查找或创建一个图表,显示每年三个、四个和五个 9 的可用时间相当于多少时间。您认为这些数字是现实的吗?每个点允许的停机时间有什么有趣的地方,或者允许的停机时间变化有多大?
1. Find or create a chart showing how much time per year three, four, and five 9s of availability translates to. Do you think these are realistic numbers? Is there anything interesting in the amount of downtime allowed at each point, or in how much the amount of allowable downtime changes?
2.图 23-1中链路故障的影响并未列出增加的延迟。如果您更改了网络,使原始链路成为本地电路,并且备份路径经过更长的距离,那么在发生上述故障的情况下是否需要考虑延迟?
2. Increased delay is not listed among the effects of a link failure in Figure 23-1. If you changed the network so the original link was a local circuit, and the backup path traveled over a much longer distance, would delay be something to look for in the case of the described failure?
3. 查找串联链路或设备的计算。这种计算与并行设备或链路的计算有什么区别?
3. Find the calculation for links or devices connected in series. What is the difference between this calculation and the calculation for devices or links in parallel?
4. 并行运行多个设备并允许控制平面绕过故障是否可以消除所有网络中正常重启或 ISSU 的需要?为什么或者为什么不?
4. Will running multiple devices in parallel, and allowing the control plane to route around failures, eliminate the need for graceful restart or ISSU in all networks? Why or why not?
5. 根据状态/优化/表面 (SOS) 模型考虑具有并联链路的多平面核心。在决定是否包含分流链路时需要权衡哪些因素?
5. Consider a multiplanar core with shunt links in light of the State/Optimization/Surface (SOS) model. What are the tradeoffs when deciding whether or not to include shunt links?
凌晨 2 点,网络瘫痪了,CEO 正在打电话询问何时恢复。对早上营业至关重要的通宵工作失败了,如果网络在接下来的一个小时左右无法修复,公司将损失数百万美元。几乎每个网络工程师在其职业生涯中都至少遇到过一次这个问题,通常会伴随着激烈的喊叫(和/或尖叫),以及惊慌失措的尝试找到根本原因并解决它。
It’s 2 a.m., the network is down, and the CEO is on the phone asking when it is going to be back up. The overnight job crucial to the business opening in the morning has failed, and the company stands to lose millions of dollars if the network is not fixed in the next hour or so. Almost every network engineer has faced this problem at least once in his career, often involving intense bouts of shouting (and/or screaming) intermixed with panicked attempts to find the root cause and fix it.
然而,故障排除是一项几乎无人教授的技能。有许多计算机科学课程确实包含故障排除课程,但这些课程往往主要侧重于工具,而不是技术,或者侧重于实际技能应用。虽然本章不是故障排除的完整课程,但它将提供故障排除的基本概述,包括问题集和一些您会发现这些工具有助于(更快)快速定位和解决问题。本章将回答的基本问题是
And yet troubleshooting is a skill that is hardly ever taught. There are a number of computer science programs that do include classes in troubleshooting, but these tend to be mostly focused on tools, rather than technique, or focused on practical skill application. While this chapter cannot be a complete course in troubleshooting, it will provide a basic overview of troubleshooting, including the problem set and some tools you will find helpful in locating and fixing problems (more) quickly. The basic question this chapter will answer is
查找和解决网络问题的最有效流程是什么?
What is the most effective process for finding and fixing problems in a network?
以下每一节都将解答该问题的一部分。
Each of the following sections will address one part of the answer to this question.
笔记
Note
在许多情况下,本章中提出的观点将通过以第一人称讲述的故事来举例说明。这些是故障排除成功和失败的真实故事,旨在帮助您理解所要表达的观点。
In many cases, the points made in this chapter will be exemplified through stories told in the first person; these are true stories of troubleshooting success and failure supplied to help you understand the point being made.
故障排除往往是一种缩小范围的练习——从对问题的广泛且不精确的描述开始,转向更有针对性的描述,最后找到网络中需要更改的一个或多个内容以解决问题。然而,与设计一样,如果您第一次尝试解决问题并不是“问题”,那么通常很容易过快地缩小范围,然后跳来跳去,而不是记住重新关注系统的总体目的。
Troubleshooting tends to be an exercise in narrowing—starting from a broad and imprecise description of the problem, moving to a more focused description, and finally finding one or more things to change in the network to resolve the problem. As with design, however, it is often easy to narrow too quickly and then to hop around rather than remembering to refocus on the overall purpose of the system if your first attempt at solving the problem does not turn out to be “the” problem.
在漫长而疲惫的故障排除过程中,很容易将系统视为应用程序运行的网络路径和应用程序本身。使用电子产品而不是网络的示例:
In the middle of a long, exhausting troubleshooting session, it is easy to think of the system as the network path the application runs over and the application itself. To use an example from electronics, rather than networking:
航线上的设备之一是风速指示器。这是一个非常简单的设备的奇特名称;一根小“鸟”固定在一根杆子的顶部,尾巴引导小鸟迎风,然后在鸟的鼻子上有一个叶轮,连接着直流(DC)电机。直流电机驱动一个带有风速刻度的简单直流电压表,整个系统使用风速指示器盒中的电阻桥和风鸟本身的另一个电阻桥进行校准。叶轮发出的电力通过 12 号电缆传输到几英里外的电压表。这些电缆特别麻烦,因为它们是埋在地下的,必须使用凝胶涂层连接器进行拼接,并将接头埋在凝胶填充的套管中。这一切都是在充满氮气的管道出现之前进行的,以防止水进入。
One of the pieces of equipment on the flightline was a wind speed indicator. This is fancy name for a really simple device; there was a small “bird” attached to the top of a pole with a tail guiding the bird into facing the wind, and then at the nose of the bird an impeller attached to a Direct Current (DC) motor. The DC motor drove a simple DC voltmeter graduated with wind speeds, and the entire system was calibrated using a resistive bridge in the wind speed indicator box, and another in the wind bird itself. The power from the impeller was passed to the voltmeter, several miles away, through a 12-gauge cable. These cables were particularly troublesome, as they were buried, and had to be spliced using gel-coated connectors, with splices buried in gel-filled casing. This was all before the advent of nitrogen-filled conduit to keep water out.
在一个特定的情况下,接头出现故障,需要用手挖出电缆,然后打开接头并进行修复。一个特别团队被召集来重新拼接电缆,但即使新的拼接就位,风力系统也无法校准以正常工作。电缆团队认为电缆具有所有正确的电压和电阻读数;我们反驳说,在熔接失败之前,设备一直工作正常,并且在台架上测试一切正常,所以问题肯定还是在熔接处。争论持续了好几天。从有线电视团队的角度来看,他们的“系统”运行正常。从气象技术人员的角度来看,该系统不是,尽管可测试的组件是。谁是对的?归根结底是:“系统”由什么组成,“正常工作”意味着什么?
In one particular instance, a splice failed, requiring the cable to be dug up by hand, and the splice opened and repaired. A special team was called in to resplice the cable, but even with the new splice in place, the wind system could not be calibrated to work correctly. The cable team argued the cable had all the right voltage and resistance readings; we argued back that the equipment had been working correctly before the splice failed, and all tested on the bench okay, so the problem must still be in the splice. The argument lasted for days. From the view of the cable team, their “system” was working properly. From the perspective of the weather techs, the system was not, even though the testable components were. Who was right? It all came down to this: What does the “system” consist of, and what does “working properly” mean?
顺便说一句,最终,在电容串扰测试中,电缆接头被认为是问题所在。重新进行拼接,问题消失。
Eventually, by the way, the cable splice was fingered as the problem in a capacitive crosstalk test. The splice was redone, and the problem disappeared.
目的最终是系统应该做什么,而不仅仅是您可以测量的事情。网络或网络的某些组件是否工作正常并不重要。重要的是系统是否实现了其目的。
The purpose is ultimately what the system is supposed to do, not just what you can measure. It does not matter if the network, or some component of the network, appears to be working fine. What matters is whether or not the system is accomplishing its purpose.
当然,这意味着您需要了解系统的目的是什么。从最广泛的角度来看,这意味着从业务角度来看系统应该完成什么任务。从构建网络的工程师的角度来看,网络可以运行得很好,但如果它没有按照业务问题需要解决的方式解决业务问题,那么它仍然是坏的。
Of course, this means you need to understand what the purpose of the system is. In the broadest view, this means what the system is supposed to accomplish from a business perspective. A network can be running just fine from the perspective of the engineers who built it, but if it is not solving a business problem the way the business problem needs to be solved, it is still broken.
另一方面,重要的是要记住,业务人员并不总是准确地理解业务和网络的关系,或者他们可能对网络的能力或可能性抱有不切实际的期望。在这些情况下,请克制住问“多高?”的冲动。当商家说:“跳吧!” 相反,培养一种对话,让你作为网络工程师有权说:“不,这会增加太多的复杂性”或“这里的权衡太高了”。
On the other hand, it is important to remember business folks do not always understand precisely how the business and the network relate, or they may have unrealistic expectations of what the network is capable of, or what is possible. In these cases, resist the urge to ask, “How high?” when the business says, “Jump!” Rather, cultivate a conversation in which you, the network engineer, have the right to say, “No, this will add too much complexity,” or “The tradeoff here is too high.”
从业务转向网络本身,有一个不同但仍然很大的上下文:网络组件。
Moving from the business to the network itself, there is a different, but still large, context: the network components.
说“网络是由组件组成的”就像说“手工制作的玻璃动物动物园是由……玻璃组成的”——这不是很有用。更具体地说,网络由哪些组成部分组成?在网络世界里,有
Saying “a network is made up of components” is like saying “a menagerie of hand-made glass animals is made up of…glass”—it is not very useful. More specifically, what are the components of a network? In the network world, there are
• 处理和转发流量的硬件设备,例如路由器、交换机和状态数据包过滤器
• Hardware devices that process and forward traffic, such as routers, switches, and stateful packet filters
• 环境,例如电源和冷却
• The environmentals, such as the power and cooling
• 布线、接口和其他硬件
• The cabling, interfaces, and other hardware
• 在这些设备上运行的软件(操作系统)
• The software running on these devices (the operating system)
• 提供转发数据包所需信息的软件应用程序;控制平面
• The software applications providing the information needed to forward packets; the control plane
• 网络设计和运行所需的规范,以满足业务需求
• The specifications to which the network was designed and needs to operate in order to fulfill business requirements
• 网络支持的应用程序对网络提出的要求
• The requirements placed on the network by the applications the network is supporting
一组更广泛、更简单的术语可能是:需求+网络软件+协议+设备。再说一次,这可能有点明显,但很容易忘记凌晨 2 点的整个画面,当时火势很猛,而你正试图扑灭它们。
A broader, and simpler, set of terms might be: requirements + network software + protocols + equipment. Again, this might be a little obvious, but it is easy to forget the entire picture at 2 a.m. when the fires are burning hot, and you are trying to put them out.
您对这四个系统的了解程度如何?您能否详细了解它们,直至传输的最后一个数据包以及每个数据包中的最后一位?您能否了解网络中每个数据包的流向以及任何特定应用程序推送到数据包中的每条信息,或者不断变化的业务需求的完整集合?
How well can you know each of these four systems? Can you know them in fine detail, down to the last packet transmitted and the last bit in each packet? Can you know the flow of every packet through the network, and every piece of information any particular application pushes into a packet, or the complete set of ever-changing business requirements?
显然,这些问题的答案是否定的。
Obviously, the answer to these questions is no.
当网络中的这四个系统交互时(还记得第一章中的交互界面吗?),它们创建了一个遭受组合爆炸的更大系统。图 24-1说明了这一点。
As these four systems within a network interact (remember interaction surfaces from the first chapter?), they create a larger system that suffers from a combinatorial explosion. Figure 24-1 illustrates.
有太多的组合和太多的可能状态,任何一个人都无法知道所有这些。如何减少信息量,以便能够合理地了解整个系统的状态,从而能够排除故障?通过构建系统组件、这些组件之间的交互点以及最终系统本身的抽象模型。
There are far too many combinations, and far too many possible states, for any one person to know all of them. How can you reduce the amount of information so you can reasonably understand the state of an entire system, and hence be able to troubleshoot it? By building abstract models of the system’s components, the interaction points between those components, and, ultimately, of the system itself.
这是有效故障排除的第一个技能:构建一组描述系统的模型。
This is the first skill of effective troubleshooting: build a set of models describing the system.
所有模型都必然是不完整的;模型只能代表整个系统或子系统的某些方面。因此,模型是一把双刃剑:它们提供了更容易理解的系统版本,但它们也提供了必然不完整的系统版本。
All models will necessarily be incomplete; a model can represent only some aspects of an entire system or subsystem. Thus, models are a two-edged sword: they present a more readily understandable version of a system, but they also present a necessarily incomplete version of a system.
没有单一的方法或单一的工具集可以用来构建有效的模型。本节将考虑一些常见的模型类型——如何建模以及什么模型——并讨论如何构建它们。这里还考虑了准确模型的重要性以及在故障排除过程中有效使用模型的能力。
There is no single way, nor a single set of tools, you can use to build an effective model. This section will consider some common kinds of models—how models and what models—and discuss how to build them. The importance of accurate models and the ability to use models in the troubleshooting process effectively are also considered here.
笔记
Note
当然,这有一些有趣的含义。例如,当系统是“黑匣子”时,这意味着您不应该知道系统是如何工作的,您对系统本身进行故障排除的能力是不存在的,并且您对黑匣子所在的任何更大系统进行故障排除的能力也是不存在的。某个部件受到严重阻碍。
This has some interesting implications, of course. For instance, when a system is a “black box,” which means you are not supposed to know how the system works, your ability to troubleshoot the system itself is nonexistent, and your ability to troubleshoot any larger system of which the black box is a component is severely hampered.
如何使用问题/解决方案集构建模型。事实上,整本书都是构建模型的练习,使用三个步骤的过程:
How models are built using problem/solution sets. This entire book, in fact, is an exercise in building how models, using a three-step process:
1.确定需要解决的问题。
1. Determine the problem that needs to be solved.
2. 研究特定问题的可能解决方案的范围。
2. Investigate the range of possible solutions for the particular problem.
3. 了解这个特定的实现如何使用特定的解决方案来解决特定的问题。
3. Understand how this particular implementation uses a particular solution to solve a particular problem.
除了这三步之外,潜在的第四步是将许多解决方案集成到一个完整的系统中,考虑解决方案之间的相互作用(例如一个解决方案降低或提高其他解决方案的有效性等)。
A potential fourth step beyond these three is integrating many solutions together into a complete system, considering the interaction between the solutions (such as where one solution reduces or increases the effectiveness of some other solution, etc.).
当然,模型不仅仅回答一类问题;模型如何回答这些问题?例如,在考虑如何形成边界网关协议 (BGP) 对等体时:
How models, of course, do not answer only one sort of question; for instance, when considering how Border Gateway Protocol (BGP) peers are formed:
1. BGP 如何管理两个 BGP 发言者之间的流量控制、错误控制/纠正和数据编组?
1. How does BGP manage flow control, error control/correction, and data marshaling between two BGP speakers?
2. BGP如何管理两个BGP发言者之间的对等状态?
2. How does BGP manage the peering state between two BGP speakers?
3. 运营商如何配置BGP以在两个BGP发言者之间正确形成对等关系?
3. How does an operator configure BGP to properly form peering relationships between two BGP speakers?
其中每一个都是一种单独的“如何”问题。许多工程师在为故障排除奠定坚实的基础时犯的错误是知道第二个问题和第三个问题的答案,而从不花时间在第一个问题上。工程往往最终关注的问题是如何完成这项工作?而不是这是如何工作的?,具体为什么会这样?结果是一种“狩猎和啄食”式的故障排除方式——将过去问题的小片段与有关如何配置事物以使它们正常工作的大量知识相结合。这通常是一种非常低效的方法,不仅用于设计网络和协议,而且用于排除故障。
Each of these is a separate kind of how question. Where many engineers go wrong in building a solid foundation for troubleshooting is knowing the answers to the second and third questions, while never spending time on the first. Engineering tends to end up being focused on the question how do I get this done? rather than how does this work?, specifically why does this work this way? The result is a “hunt and peck” sort of troubleshooting style—combining small snippets of past problems with lots of knowledge about how things should be configured to make them work. This is generally a very inefficient way to not only design networks and protocols, but also to troubleshoot them.
您需要能够构建所有三种类型的模型。有几种有用的方法可以建立您的模型库存,包括
You need to be able to build how models of all three types. There are several useful ways to build up your stock of how models, including
• 阅读协议理论和规范,以便了解协议如何以及为何运行(正在解决哪些问题以及如何解决这些问题)
• Reading protocol theory and specifications, so you understand how and why a protocol operates (what problems are being solved and how they are being solved)
• 检查协议和网络的设计以及它们在现实世界中的表现
• Examining the designs of protocols and networks, and how they have performed in the real world
• 学习基本算法和启发式方法,以及它们旨在解决的问题
• Learning basic algorithms and heuristics, along with the problems they are designed to solve
从本质上讲,建立模型更多的是理论而不是实践;这就是为什么工程师经常跳过学习如何建模的原因,这阻碍了他们长期提高工程技能。有时如何最好地以图形格式表达模型,例如通用建模语言 (UML) 草图或流程图。
Essentially, building how models is more about theory than practice; this is why engineers often skip learning how models—which prevents them from growing their engineering skills over the long term. How models are sometimes best expressed in a graphical format, such as Universal Modeling Language (UML) sketches, or flow charts.
哪些模型不同于描述特定网络或应用程序状态的模型,或者在许多网络或应用程序中发现的常见模式。这些类型的模型通常回答以下问题:
What models are different from how models in describing the state of a particular network or application, or a common pattern found across many networks or applications. These kinds of models generally answer questions such as
• 此流量(信号路径)通过系统(例如网络、应用程序等)的正常路径是什么?
• What is the normal path of this traffic flow (the signal path) through the system (such as a network, application, etc.)?
• 该网络上正常的最高通话者集合是什么?
• What is the normal set of top talkers on this network?
• 网络中这两条路径之间的负载正态分布是怎样的?
• What is the normal distribution of load between these two paths in the network?
• 两个BGP发言者之间的正常启动过程是怎样的?
• What is the normal startup process between two BGP speakers?
学习的唯一真正方法是多次观察和总结。例如,通过观察大量网络中的最高发言者,甚至在多年的单个网络中,可以让您很好地了解在哪里寻找最高发言者,并很好地了解最高发言者何时出现在网络中。这种情况没有道理。
The only real way to learn what is to observe and summarize many times over. For instance, observing the top talkers across a large number of networks, or even in a single network across a number of years, will give you a good sense of where to look for top talkers, and a good sense of when the top talkers in this situation do not make sense.
观察过程中的操纵在这里也很重要。图 24-2说明了这一点。
Manipulation in the observation process is important here, as well; Figure 24-2 illustrates.
在图 24-2中,某个值(可能代表一个事件或对象的属性)被分配了一个变量 X。因果关系问题的答案是:X 是否以某种方式导致 Y?为了回答这个问题,需要考虑一系列可能的干预措施,或者更确切地说,可能会修改 X 的行动。为了显示 X 导致 Y,所有其他潜在干预措施 Z 1到 Z n保持不变,同时操纵一种潜在干预措施 Z i 。可操作性对于构建模型非常有用帮助您理解系统不同部分之间的关系;如果你通过改变Z i来操纵X对Y的影响,那么你可以更好地理解X和Y之间的关系。
In Figure 24-2, some value (representing, perhaps, an event or a property of an object) is assigned a variable, X. The question causation answers is: does X somehow cause Y? To answer this question, a range of possible interventions, or rather actions that will potentially modify X, are considered. In order to show X causes Y, all other potential interventions, Z1 through Zn, are held constant, while one potential intervention, Zi, is manipulated. Manipulability is useful in building how models by helping you understand the relationship between the different parts of a system; if you manipulate the impact of X on Y by changing Zi, then you can better understand the relationship between X and Y.
例如,假设您想了解特定应用程序如何在流量方面使用网络,发现此信息的一种可能方法是设置应用程序的测试实例,该实例通过路由器,您可以在该实例上进行操作服务质量 (QoS) 设置。通过在观察流量的同时操作 QoS 设置,您可以更好地了解应用程序的工作方式;您实际上是在构建应用程序操作的模型。
For instance, assume you want to understand how a specific application uses the network in terms of traffic flow, one possible way to go about discovering this information is by setting up a test instance of the application that passes through a router on which you can manipulate the Quality of Service (QoS) settings. By manipulating the QoS settings while watching the traffic flow, you can get a better sense of how the application works; you are literally building a what model of the application’s operation.
在尽可能接近现实世界的条件下而不是在“实验室环境”中构建此类模型非常重要。一个真实世界的例子可能会有所帮助,以第一人称讲述:
It is important to build such models under as close to real-world conditions as possible, rather than in a “lab environment.” A real-world example might be helpful here, told in the first person:
有一次,我被叫去和我店里的另一位技术人员一起开发 FPS-77 风暴探测雷达。发射电路有问题;发射器只是没有发电。一个电阻一直在“正确区域”烧断,所以我们检查了电阻,果然,它看起来像是短路的。我们订购了另一个电阻器,关闭设备,然后早上回家(当我们完成这个工作时,已经是凌晨 3 点左右了)。第二天,零件送来并由其他人安装。电阻很快又出现短路,雷达系统无法恢复正常。
One time I was called out with another tech from my shop to work on the FPS-77 storm detection radar. There was some problem in the transmitter circuit; the transmitter just was not producing power. A resistor blew in the “right area” all the time, so we checked the resistor, and sure enough, it seemed like it was shorted. We ordered another resistor, shut things down, and went home for the morning (by the time we finished working on this, it was around 3 a.m.). The next day, the part came in and was installed by someone else. The resistor promptly showed a short again, and the radar system failed to come back up.
什么地方出了错?我检查了哪些是容易检查的,哪些是常见问题,然后走开,以为我已经找到了问题,那就是问题。又花了一天的时间进行故障排除才确定问题,与原始电阻器并联的一个组件发生了故障,但不在同一块板上,甚至不在原理图的同一区域。第二个组件是一个电感器,本质上只是一根紧紧缠绕在陶瓷芯上的电线。电感器只有在交流电通过时才会表现出电阻;当直流电通过时,它们总是会出现短路。由于电阻和电感是并联的,而欧姆表(测量电阻的装置)仅使用直流电来测量电阻,所以整个电路出现短路。
What went wrong? I checked what was simple to check, what was a common problem, and walked away thinking I had found the problem, that is what. It took another day’s worth of troubleshooting to pin the problem down, a component in parallel with the original resistor, but not on the same board, or even in the same area of the schematics, had failed. This second component was an inductor— essentially just a piece of wire wound tightly around a ceramic core. Inductors only show resistance when alternating current passes through them; they will always show a short when direct current is passed through them. Because the resistor and the inductor were in parallel, and the ohmmeter (a device which measures resistance) only uses direct current to measure resistance, the entire circuit appeared to be shorted.
事实上,电感器出现了故障,但是欧姆表,因为它无法以正确的频率和功率水平产生交流电,所以根本无法“看到”正确的故障,而且我太累了,并且太确信我已经在中找到了问题第一次尝试(因为这是该信号路径中的“常见”问题),以检查第一个发现之外的情况。
In reality, the inductor failed, but the ohmmeter, because it cannot produce alternating current at the right frequency and power level, simply could not “see” the correct failure, and I was too tired, and too convinced I had found the problem in the first try (because it was the “common” problem in this signal path), to check beyond the first discovery.
There are several kinds of what models engineers should build, including
• 生产环境中每个系统正常状态的描述。这通常称为基线,应包括在一天中不同时间、不同季节以及不同类型的常规事件期间在网络中不同点测量的流量水平;任何特定进程运行所需的时间,例如运行链路状态控制平面的网络中的最短路径优先(SPF)运行时;每个应用程序的网络抖动和延迟等。这些措施很重要,因为除非您知道“正常”是什么样子,否则您无法知道“损坏”是什么样子。
• A description of the normal state of each system in a production environment. This is often called the baseline, and should include traffic levels measured at different points in the network at different times of the day, different seasons, and during different kinds of regular events; the amount of time it takes for any particular process to run, such as the Shortest Path First (SPF) runtime in a network running a link state control plane; jitter and delay through the network on a per application basis, etc. These measures are important because you cannot know what “broken” looks like unless you know what “normal” looks like.
• 生产环境中每个系统的正常配置的描述。许多网络都有一个单一的事实来源,其中包含每个设备的建议配置。许多自动化系统的设计目的是确保每个设备都与这个单一事实来源中包含的建议配置相匹配。这些系统还应该包含每个配置背后的意图,因为单个意图可以用多种不同的方式表达。
• A description of the normal configuration of each system in a production environment. Many networks will have a single source of truth that contains the proposed configuration for each device. Many automation systems are designed to ensure each device matches the proposed configuration contained in this single source of truth. These systems should also contain the intent behind each configuration, as a single intent can be expressed in many different ways.
• 网络对不同类型事件的“正常”反应的描述。
• A description of the “normal” reaction of the network to different types of events.
• 对网络上运行的每个应用程序的信号路径的描述,包括应用程序信息的来源、流量通常通过网络的路径、排队以及处理流量的其他方式。
• A description of the signal path of every application running on the network, including the origination of information from the application, the paths the traffic normally takes through the network, queuing, and other ways in which the traffic is processed.
• 网络中安全边界的描述,包括每个安全域的边界(逻辑或拓扑)、安全域存在的原因以及各个安全域如何交互。
• A description of the security boundaries in the network, including the boundaries of each security domain (logical or topological), why the security domain exists, and how the various security domains interact.
虽然其中一些内容必然是完整的,但在包含特定领域内的每条可用信息时,能够将每条信息总结为一个模型也很重要。知道要抽象什么是一项需要多年培养的技能,而且永远不能说是完美的。
While some of these will necessarily be complete, in containing every available piece of information within a particular domain, it is also important to be able to summarize each one into a model. Knowing what to abstract is a skill that takes years to develop, and can never said to be perfected.
事实上,选择将哪些信息抽象到模型中是一个非常困难的问题,以至于网络工程师经常使用构建不佳的模型,这些模型根本不能准确地表示底层系统,或者根本不提供底层系统的信息。良好的故障排除或设计所需的信息。第 3 章“网络传输建模”讨论了这样一个(有争议的)示例,即广泛使用的开放系统互连 (OSI) 模型。OSI 模型是在狭窄范围内的有用模型的一个很好的例子,但通常在其增值的领域之外使用。图 24-3展示了 OSI 和递归互联网架构 (RINA) 模型以供参考。
In fact, choosing which information to abstract into a model is such a difficult problem that network engineers often live with poorly built models that simply do not represent the underlying system accurately—or simply do not provide the information needed to ground good troubleshooting or design. Chapter 3, “Modeling Network Transport,” discusses one such (controversial) example, the widely used Open Systems Interconnect (OSI) model. The OSI model is a good example of a useful model in a narrow range of contexts, but is often used far outside the domain in which it adds value. Figure 24-3 illustrates the OSI and the Recursive Internet Architecture (RINA) models for reference.
一个简单的问题说明了这两种通过网络转发信息的模型之间的差异:该模型描述了什么功能?开放系统互连(OSI)模型通常描述网络中每一层所包含的信息种类,这对于描述层与层之间携带的信息种类以通过网络传送信息也很有用。另一方面,RINA 模型侧重于功能:针对此特定连接(无论是逐跳还是端到端),解决了网络堆栈中的哪些问题。
A simple question illustrates the differences between these two models of forwarding information through a network: what functionality does this model describe? The Open Systems Interconnect (OSI) model generally describes the kinds of information contained at each layer in the network, which is also useful in describing the kind of information carried between layers to carry information through the network. The RINA model, on the other hand, focuses on functionality: what problem is solved where in the network stack for this particular connection (whether hop by hop or end to end).
虽然 OSI 模型通常可用于对网络堆栈进行编码,因为它描述了信息和 API,但 RINA 模型通常更适用于理解网络堆栈 — 了解在何处发生了什么以及为什么发生。换句话说,RINA 模型更符合解决网络问题所需的问题/解决方案思维方式。
While the OSI model is often useful for coding a network stack, because it describes information and APIs, the RINA model is often more useful for understanding a network stack—knowing what is happening where and why. The RINA model, in other words, more closely aligns with the problem/solution mindset needed to troubleshoot a network problem.
准确性并不意味着完美,即系统的每个方面都得到体现,而是符合目的。可能需要不同的模型来理解特定系统的不同方面;一个模型可能有助于解决一类问题,而另一种模型可能有助于解决另一类问题(或同一整体问题的另一部分)。
Accuracy does not mean perfection, in the sense that every aspect of the system is represented, but rather fit to purpose. Different models may be required to understand different aspects of a particular system; one model may be useful for troubleshooting one sort of problem, and another model may be useful for troubleshooting another sort of problem (or another part of the same overall problem).
然而,如果您不知道如何将这些模型应用于手头的问题,那么您头脑中拥有大量模型来描述系统的各个方面并没有什么帮助。如果您只是将所有这些模型添加到您已有的系统知识库中,那么故障排除就会变得更加困难,而不是更容易。关键在于了解如何将模型应用到故障排除过程中。
Having a lot of models in your head to describe various aspects of the system is not helpful, however, if you do not know how to apply these models to the problem at hand. If you just take all these models and add them to the store of knowledge you already have about the system, you can make troubleshooting harder, rather than easier. The key lies in knowing how to apply models to the troubleshooting process.
将模型应用于故障排除的第一步是学习如何根据需要在各种模型之间切换,因为您需要了解系统的更广泛视图和系统任何部分的更详细视图之间的转换。图 24-4说明了这一点。
The first step in applying models to troubleshooting is to learn how to shift between the various models as you need to, as you move between needing to understand a broader view of the system and a more detailed view of any piece of the system. Figure 24-4 illustrates.
在图24-4中,整个系统被描述为网络路径。当然,在现实生活中,会有更大的上下文,例如业务需求、一个应用程序或一组应用程序,但使用网络路径作为更大的上下文将适用于此示例。假设您正在解决网络中特定流的问题;您牢记在心的整体模型是整个网络路径。然而,当您遇到有关该问题的各个信息时,您可能会意识到问题似乎出在数据平面中,其中可能包括 QoS。进一步的信息可能表明问题是应用程序对网络抖动的反应,这应该将您从 QoS 模型转移到抖动模型,从而引发不同的模型。
In Figure 24-4, the overall system is depicted as a network path. Of course, there would be a larger context, such as business requirements, an application, or a set of applications, in real-life situations, but using the network path as a larger context will work for this example. Assume you are troubleshooting a problem with a specific flow through the network; the overall model you would keep in your head is the entire network path. As you encounter individual pieces of information about the problem, however, you might realize the problem appears to be something in the data plane, which might include QoS. Further information might indicate the problem is the application’s reaction to jitter on the network, which should move you from the QoS model into a jitter model, which evokes a different model.
当您沿着树向下移动时,每个模型都将包含特定区域内的更多详细信息,但它将排除网络其他区域的更多信息。每个模型应该准确地代表当前的问题和系统;在此过程中的任何时候使用“不适合”的模型都可能导致您在“模型树”中走上错误的道路。
Each model, as you move down the tree, is going to contain more detail within the specific area, but it is going to exclude more information from other areas of the network. Each model should accurately represent the problem and system at hand; using a model that “does not fit” at any point in this process can lead you down the wrong path in the “model tree.”
工程师在故障排除时经常面临的另一个问题是不愿意在精神上回到树的顶部;在这种情况下,很容易将注意力集中在抖动问题上,而不考虑控制平面的收敛特性如何与抖动相互作用。一旦形成了关于问题是什么的想法,重要的是从“模型树”的顶部开始,并考虑到新信息,从更抽象的模型返回到更具体的模型。一个以第一人称讲述的例子可能会有所帮助:
Another problem engineers often face in troubleshooting is an unwillingness to mentally move back toward the top of the tree; in this case, it would be easy to focus on the jitter problem without considering how the convergence characteristics of the control plane might interact with jitter. Once an idea has been formed about what the problem is, it is important to start back at the top of the “model tree” and work back down toward the more specific models from the more abstract ones taking the new information into account. An example might be helpful here, told in the first person:
两名工程师被叫去处理一家相当重要的银行的重大网络故障;由于某种原因,路由协议(增强型内部网关路由协议 (EIGRP))在某些条件下根本无法收敛。技术支持中心 (TAC) 尝试了各种故障排除技术,以各种方式重新配置 EIGRP;最终问题被升级至全球升级团队。
Two engineers were called into a major network failure at a rather important bank; for some reason, the routing protocol, the Enhanced Interior Gateway Routing Protocol (EIGRP), simply would not converge under some conditions. The Technical Assistance Center (TAC) had tried various troubleshooting techniques, reconfiguring EIGRP in various ways; finally the problem was escalated to the Global Escalation Team.
首先,升级工程师开始收集有关中断期间 EIGRP 状态的所有信息;因此,使用的主要模型是围绕 EIGRP 协议的运行及其收敛过程。随着时间的推移,与丢失 EIGRP 数据包相关的问题变得越来越明显;因此工程师开始研究数据包处理和路由器之间的传输链路。因此,用于故障排除的模型转向传输和数据包处理侧。数据包似乎正在传输,但没有被接收,因此焦点再次转移到受影响路由器上的数据包处理。由于网络中的几乎每个路由器似乎都受到影响,因此这仍然是一个非常广泛的范围,但它很快就深入到了关于数据包如何接收、排队、。
To begin, the escalation engineers started gathering all the information they could on the state of EIGRP during the outage; thus the primary model in use was around the operation of the EIGRP protocol and its convergence process. Over time, it became apparent the problem related to missed EIGRP packets; so the engineers began looking at the packet processing and the transport links between the routers. Thus the model in use for troubleshooting moved towards the transport and packet processing side. It appeared the packets were being transmitted, but not received, so the focus again shifted to packet processing on the impacted routers. As almost every router in the network appeared to be impacted, this was still a very wide scope, but it quickly dove into rather detailed considerations about how packets were received, queued, and forwarded on to the EIGRP process, and how information about what was going on in this processing could be gathered in a production network.
为了发现此信息,设置了一个服务器来定期捕获多个路由器的输入队列。每次 EIGRP 邻居状态出现故障时,都会提取日志文件以查看当时队列中具体是什么导致 EIGRP 数据包无法传送。至少可以说,结果令人费解。队列中的数据包始终来自同一 Internet 协议 (IP) 地址。没有人能够识别该 IP 地址,因此工程师开始怀疑存在某种形式的拒绝服务攻击。然而,最终,有问题的服务器被发现了:它是一个安全服务器。网络中的路由器已配置为向该服务器发送每个命令授权请求,以确保当前登录到路由器的用户有权运行特定命令。每次命令完成并打印其输出时,输入队列中的数据包都是来自身份验证服务器的答复,允许执行该命令。不用说,这条兔子道并没有帮助更快地解决问题。
To discover this information, a server was set up to capture the input queue of several routers periodically. Each time there was a failure in an EIGRP neighbor state, the log file was pulled to see what, specifically, was in the queue at the moment in time to cause the EIGRP packets to not be delivered. The results were puzzling, to say the least; the packets in the queue were always from the same Internet Protocol (IP) address. No one could identify this IP address, so the engineers began to suspect some form of denial of service attack. Ultimately, however, the offending server was found: it was a security server. The routers in the network had been configured to send a per command authorization request to this server to ensure the user currently logged in to the router had permission to run the specific command. The packets in the input queue each time the command completed and printed its output were the reply from the authentication server allowing the command to be executed. Needless to say, this rabbit trail did not help solve the problem any faster.
最终关闭了命令级认证,问题才找到。网络中的每台主机上都安装了新的备份软件包——每台主机上都安装了服务器版本,而不是客户端版本。服务器版本为了找到客户端,尝试使用跨整个 IP 地址范围的子网广播以非常高的数据包速率联系网络上的每个主机。这些子网广播被实际路由器消耗,堵塞本地进程输入队列,从而导致 EIGRP 数据包被丢弃。
Eventually, the command level authentication was turned off, and the problem was found. A new backup software package had been installed on every host in the network—the server version had been installed on every host, rather than the client version. The server version, in order to find clients, attempted to contact every host on the network using subnet broadcasts across the entire IP address range, at a very high packet rate. These subnet broadcasts were being consumed by the actual routers, clogging the local process input queue, and hence causing EIGRP packets to be dropped.
这里的问题最终需要在许多不同的心理模型之间进行转换,每个模型涵盖网络运行的不同部分,但重要的是,在模型之间转换时,重新关注“更大的背景”,因此手头的问题将不被遗忘。例如,在追查“流氓IP地址”的过程中,实际问题的进展完全被搁置了。
The problem here ultimately required shifting between a number of different mental models, each one covering a different part of the network’s operation, but it was important, when shifting between models, to refocus on the “larger context,” so the problem at hand would not be forgotten. For instance, progress on the actual problem was completely left aside while the “rogue IP address” was being chased down.
虽然能够在模型之间切换很重要,但在模型之间随机切换通常并不是解决问题的最有效方法(尽管它在现实世界中是一种足够常见的技术)。一旦您建立了模型并且已经具备了在模型之间切换的能力,您所需要的是某种方法来指导您在故障排除时如何在“模型树”中上下移动。本质上,您需要了解三件事:
While being able to shift between models is important, shifting between models randomly is generally not the most efficient way to go about troubleshooting a problem (although it is a common enough technique in the real world). What you need, once you have models built up and you have developed the ability to shift between models, is some way to guide how you move up and down the “model tree” while troubleshooting. Essentially, you need to know three things:
1.你需要问什么问题?
1. What question do you need to ask?
2.这个问题怎么问?
2. How do you ask this question?
3. 回答完这个问题后,您下一步应该在系统中的哪个位置继续进行故障排除?
3. Once this question has been answered, where should you move next in the system to continue troubleshooting?
一种方法脱颖而出,可以作为许多研究领域多年经验的指南:半分割方法。半分割法的步骤如下:
One method stands out as a guide across many years of experience across many fields of study: the half split method. The steps for the half split method are as follows:
1. 绘制信号通过系统的路径。
1. Map out the path of a signal through the system.
2. 将路径(大致)分成两半。
2. Split the path (roughly) in half.
3. 在中间点测试信号以确定其是否正确。
3. Test the signal at the halfway point to determine if it is correct or incorrect.
4. 如果信号不正确,请移向信号源。
4. If the signal is incorrect, move toward the source.
5. 如果信号正确,则向目的地行驶。
5. If the signal is correct, move toward the destination.
半分割方法和模型概念之间的联系应该相当明显:
The connections between the half split method and the concept of having models should be fairly apparent:
• 信号通过系统的路径可以描述为一组重叠模型,其中“较低级别”、更详细的模型是更大上下文、更抽象模型的组件。
• The path of a signal through a system can be described as a set of overlap-ping models, with “lower level,” more detailed models being components of the larger context, more abstract models.
• 信号路径将依赖于标准组件的整体运行;这些组件中的每一个都与您在追踪信号路径时遇到的系统模型之一相交或底层。
• The path of the signal is going to rely on the overall operation of standard components; each of these components either intersects or underlies one of the system models you encounter when tracing out the path of the signal.
半分割方法可用于指导您通过系统内的子系统进行故障排除过程,使用模型抽象足够的信息,以便您可以在整个过程中一步“包含”您尝试测试的部分。半分割方法还可以帮助您形成需要向系统询问的问题,以了解其是否正常运行,例如
The half split method can be used to guide your troubleshooting process through the subsystems within the system, using models to abstract enough information so you can “contain” the pieces you are trying to test in one step through the process. The half split method also helps you form the questions you need to ask of the system to know whether it is operating properly, such as
• 此时系统的实际状态以及信号结果是什么?
• What is the actual state of the system, and the result on the signal, at this point?
• “正常”状态应该是什么样子?
• What should a “normal” state look like?
半分法还可以迫使您花时间思考该过程的每一步。在排除故障时,很容易跳到您认为可能出现问题的位置,然后深入研究问题的一小部分。使用半分割方法将迫使您查看所看到的内容,将其与应该存在的内容进行比较,并定期返回到更大的上下文(移回“模型树”)。这些对于不迷失方向、解决不是问题或只是症状而不是问题的问题都至关重要。
The half split method can also force you to take your time and think through each step of the process. It is easy, when troubleshooting, to simply jump to where you think the problem might be and then dive into that small piece of the problem. Using the half split method will force you to look at what you are seeing, compare it to what should be there, and return to the larger context (move back up the “model tree”) on a regular basis. These are all crucial to not getting lost in the weeds, troubleshooting something that is either not a problem or is a symptom instead of a problem.
当考虑网络中半分裂的概念时,信号是什么?本质上,它是您可以查看以验证系统状态的任何内容。例如:
When considering the concept of half splitting in a network, what is the signal? Essentially, it is anything you can look at to verify the state of the system. For instance:
• 路由协议中邻居邻接的状态
• The state of a neighbor adjacency in a routing protocol
• 流中给定数据包集的抖动或延迟
• The jitter or delay on a given set of packets within a flow
• 网络中特定点存在数据包流
• The existence of a flow of packets at a particular point in the network
• 网络中特定点的数据包流的完整性(主要查找丢失的数据包)
• The completeness of a flow of packets at a particular point in the network (primarily looking for dropped packets)
要找到信号,请弄清楚此时任何特定信息流的预期情况(无论是本地的、对等点之间的,还是端到端的、主机到主机的等),然后确定您认为应该发生的情况看起来像您打算在网络中测试的点。您可能只需要在故障排除期间在模型之间切换,但您可能还需要在信号之间切换。例如,在图24-4给出的示例中,您需要从检查网络数据包流中的抖动转向控制平面收敛的方式,这可能涉及分发有关变化的信息的信号路径在网络拓扑中。两者都陷入单一信号路径并且从一个信号跳到另一个信号是故障排除时要避免的错误。可以使用以下部分中概述的两种方法来避免这些问题 - 使用可操作性和简化 - 但有时经验是唯一且最好的指导。
To find the signal, figure out what you would expect to be true of any particular information flow at this point (whether local, between peers, or end to end, host to host, etc.), and then determine what you think it should look like at the point you intend to test in the network. You might only need to shift between models during troubleshooting, but you might also need to shift between signals. For instance, in the example given in Figure 24-4, you need to shift from examining the jitter in a flow of packets through a network to the way in which the control plane converges, which may involve the signal path of distributing information about changes in the network topology. Both getting stuck in a single signal path and hopping from signal to signal are errors to be avoided in troubleshooting. These can be avoided using the two methods outlined in the following sections—using manipulability and simplification—but sometimes experience is the only and best guide.
操纵性和可操纵性是检验理论和发现相关性与因果关系之间差异的关键工具。作为参考,此处将图 24-2重复为图 24-5 。
Manipulation—and manipulability—is a key tool for testing theories and discovering the difference between correlation and causation. For reference, Figure 24-2 is repeated as Figure 24-5 here.
在故障排除中,关键点是找到一些 Z,您可以使用 Z 来修改 X 的输出,从而影响 Y。换句话说,如果 Y 是测量信号,您需要找到某种方法来修改 X 以显示X 是 Y 当前状态的原因,或者不是。图24-6通过一个例子来解释这个概念。
In troubleshooting, the key point is to find some Z you can use to modify the output of X in a way that impacts Y. In other words, if Y is the measured signal, you want to find some way to modify X to either show X is the cause of the current state of Y, or it is not. Figure 24-6 is used to explain this concept through an example.
返回到抖动示例,假设有一些应用程序在 A 和 H 之间传递流量,表现出较差的性能。对应用程序进行一些操作后,您确定问题出在主机和服务器之间的路径上的抖动上。检查各种设备的日志,您发现问题似乎与 E 上运行的 SPF 相关。使用半分割方法,您首先跟踪信号路径,或者在本例中跟踪流路径,并在这些路径之间查找数据包两个设备遵循路径 [A,B,D,F,H]。您将电路分成两半并决定检查 D 处的信号。
Returning to the jitter example, assume there is some application passing traffic between A and H showing poor performance. After some work with the application, you determine the problem is with the jitter along the path between the host and the server. Examining the logs for the various devices, you notice the problem appears to correlate with SPF running on E. Using the half split method, you first trace the signal path, or, in this case the path of the flow, and find packets between these two devices follow the path [A,B,D,F,H]. You divide the circuit in half and decide to examine the signal at D.
如何测量 D 处的抖动?最明显的解决方案是捕获数据包跟踪设备(或加载到标准主机上的软件)上的数据包流,然后导致(或等待)发生导致 E 事件的任何问题,以确定是否存在抖动当事件发生时 D 的输出。假设您发现 D 的输出实际上存在抖动;下一步不是简单地假设这就是问题所在。相反,下一步是转向源并确定输入端是否也存在抖动在这种情况下,在 E 事件期间检查 B 处的输出以查看是否也存在抖动是合乎逻辑的。跳过这个半分割步骤似乎会加快故障排除过程——您知道问题出在哪里,为什么不直接找出发生这种情况的原因呢?原因很简单:通过跳过寻找源头并寻找症状的步骤,您无法将问题隔离到网络中的单个点。我们很容易花费大量时间试图理解为什么问题发生在 D 处,却发现问题根本不在 D 处;它位于网络中较早的某个位置。
How can you measure the jitter at D? The most obvious solution is to capture the packet flow on a packet trace device (or software loaded onto a standard host), then cause (or wait for) whatever problem to occur that is causing the event at E, to determine if there is jitter at the output of D when the event occurs. Assume you find the jitter is, in fact, present at the output of D; the next step is not to simply assume this is where the problem is. Rather, the next step is to move toward the source and determine if the jitter is also present at the input to D. In this case, it would be logical to examine the output at B during the event at E, to see if the jitter is also there. Skipping this half split step might seem like it would speed up your troubleshooting process—you know where the problem is, why not just move directly to finding out why this is happening? The reason is simple: by skipping the step moving toward the source and looking for the symptoms, you are failing to isolate the problem to a single point in the network. It is all too easy to spend a lot of time trying to understand why the problem is happening at D, only to discover the problem is not at D at all; it is someplace earlier in the network.
假设 B 输出处的信号是正确的,下一步就是找到某种方法来操纵 D 处的条件以导致问题;这将验证问题在 D 处出现的事件和发生的事件(E 处的 SPF)不仅存在相关性,而且以某种方式存在因果关系。执行此操作的最佳方法是检查 E 处的任何日志,这些日志可以告诉您为什么SPF 在 E 处运行,然后在测量 D 处的信号时复制这些条件。
Assuming the signal is correct at the output of B, the next step is to find some way to manipulate the conditions at D to cause the problem; this will verify the problem showing up at D and the event occurring (SPF at E) is not just correlation but is causally related in some way. The best way to do this is to examine any logs at E that can tell you why SPF is running at E and then replicate those conditions while measuring the signal at D.
一旦问题可以被复制,你就可以肯定地知道原因是什么,并开始思考如何解决问题。
Once the problem can be replicated, you can know, for certain, what the cause is and start thinking about how to solve the problem.
笔记
Note
现实生活比这里给出的例子更混乱。在现实生活中,可能存在多种相互作用的原因,并且无法操纵网络来导致问题。有时,必须从表面上看相关性,你必须猜测,尝试解决方案,直到找到一个可以解决问题的解决方案,或者仅仅依靠你对系统内部工作原理的了解来找到解决方案而不采取任何措施完整的过程。此处描述的半分裂过程是一种理想情况;您可能需要在现场根据具体情况对其进行修改。另一方面,你越接近理想,尤其是在开始解决更简单的问题时,你就能越快地培养加快进程所需的“故障排除意识”。此外,当你陷入困境时,最好不要再依赖你的“故障排除意识”,而回到半分裂、寻找信号、测试信号以及找到操纵信号的方法的基础知识。
Real life is messier than the example given here. In real life, there can be multiple interacting causes and no way to manipulate the network into causing the problem. Sometimes, then, correlation must be taken at face value, and you must guess, trying solutions until you find one that makes the problem go away, or just relying on your knowledge of the internal workings of the system to find a resolution without taking on the full process. The half split process, as described here, is an ideal case; you will likely need to modify it on a per case basis in the field. On the other hand, the closer you can come to the ideal, especially when starting out on simpler problems, the faster you will be able to develop the “troubleshooting sense” required to speed up the process. Further, when you are stumped, it is always best to stop relying on your “troubleshooting sense,” and go back to the basics of half splitting, finding the signal, testing the signal, and finding a way to manipulate the signal.
返回到图24-5(和图24-2),有许多不同的Z(可能还有许多不同的X)。这就提出了一个有趣的问题:您如何知道要关注众多可用变量中的哪个特定变量?系统知识与丰富的经验相结合,将成为您在这里的主要指南,但您还可以做另一件事来让您的生活更简单:简化系统。
Returning to Figure 24-5 (and Figure 24-2), there are many different Z’s (and probably many different X’s). This raises an interesting question: how do you know which particular variable among the many available variables to concentrate on? Knowledge of the system, combined with a liberal dose of experience, will be your primary guides here, but there is one other thing you can do to make your life simpler: simplify the system.
例如,在具有大量并行路径的网络中,如果您首先消除组件,直到问题消失,或者直到网络只剩下一组最低限度的链路和设备,那么在解决问题时可能会取得更大进展。这可能看起来违反直觉——当网络已经出现问题时,为什么要删除冗余来排除故障?——但有时这是缩小问题范围的唯一方法。
For instance, in a network with a lot of parallel paths, you might make more headway in troubleshooting a problem if you begin by eliminating components until the problem goes away, or until the network is down to a bare minimum functioning set of links and devices. This might seem counterintuitive—why would you remove redundancy to troubleshoot, when the network is already having problems?—but it is sometimes the only way to narrow down where a problem is.
事实上,如果问题在达到某个最小集之前“消失”,那么您应该怀疑网络中存在某种形式的正反馈循环导致故障,问题的数量和/或速度存在问题。状态存在控制平面携带,或者在降低复杂度的过程中解决了问题(例如链路振荡、设备无法转发流量)。在这种情况下,您可以将复杂性重新添加到网络中,直到问题再次出现,这为您提供了良好的操作测试场景。如果问题在简化过程中没有消失,那么您现在有一个更简单的信号路径来排除故障,这将帮助您将半分裂过程集中到更有限的空间中。
If the problem does, in fact, “go away” before you reach some minimal set, then you should suspect there is some form of positive feedback loop in the network causing the failure, there is some problem with the amount and/or speed of state being carried in the control plane, or you have removed the problem in the process of reducing complexity (for instance, a flapping link, or device failing to forward traffic). In this case, you can add complexity back into the network until the problem reappears, which gives you a good manipulation test scenario. If the problem does not go away during the simplification process, then you now have a simpler signal path to troubleshoot, which will help you focus the half split process into a more confined space.
一旦发现问题,就应该解决它。然而,在现实世界中修复它的概念并不总是那么简单。通常有两个阶段来修复它:
Once the problem has been identified, you should fix it. However, the concept of fixing it isn’t always so simple in the real world. There are normally two stages to fixing it:
1. 通过更改配置、更换硬件等解决眼前的问题——临时修复
1. Solving the immediate problem with a configuration change, hardware replacement, etc.—a temporary fix
2. 通过设计或更换设备等方式预防未来出现问题——永久解决
2. Preventing the problem in the future through design or through replacement of equipment, etc.—a permanent fix
通常很难区分临时修复和永久修复;一个好的经验法则可能是
It is often very difficult to tell the difference between a temporary fix and a permanent one; a good rule of thumb might be
临时修复会产生技术债务;永久修复要么减少技术债务,要么保持不变。
A temporary fix incurs technical debt; a permanent fix either reduces technical debt or leaves it constant.
技术债务很难解释,但本质上它意味着做一些事情,要么导致未来解决问题变得更加复杂,要么导致未来发生类似的故障模式。也许例子是解释这些概念的最有用的方式。
Technical debt is very hard to explain, but essentially it means doing something that will either cause fixing a problem in the future to be more complex or will result in a similar failure mode happening in the future. Perhaps an example will be the most useful way to explain these concepts.
假设您遇到这样的情况:多个不同的虚拟网络突然同时停止承载关键业务流量。调查问题后,发现是广播风暴造成的;特定的网络接口卡 (NIC) 将随机广播数据包推送到物理网络上,以防止流量跨任何虚拟拓扑传输(命运共享的示例)。
Assume you are in a situation where a number of different virtual networks suddenly stop carrying business-critical traffic at the same time. Investigating the problem, you find a broadcast storm is causing the problem; a particular network interface card (NIC) is pushing random broadcast packets onto the physical network in a way that prevents traffic from being carried across any of the virtual topologies (an example of fate sharing).
拔掉有故障网卡的系统显然是第一个解决方案:这是临时修复还是永久修复?主机存在是有原因的,所以这一定是临时修复,对吗?是的……也不是。关闭该主机可以立即修复,但要确定这应该是临时修复还是永久修复,您需要确定主机的用途以及是否仍然需要它。如果不再需要主机,即使它已修复,仍将其保留在网络上,只会增加技术债务。该主机将来出现的一些其他问题将再次导致问题,只需将主机从网络中完全删除即可避免该问题。
Unplugging the system with the faulty NIC is the obvious first solution: is this a temporary fix or a permanent one? The host is there for a reason, so this must be a temporary fix, correct? Yes…and no. Shutting down this host provides an immediate fix, but to determine if this should be the temporary or permanent fix, you need to determine what the host is used for, and whether or not it is still needed. Leaving a host attached to a network if it is no longer needed, even once it is repaired, simply increases technical debt. Some other problem with this host in the future is going to cause a problem again, a problem that could have been avoided by simply removing the host entirely from the network.
假设需要主机,更换网卡就可以永久修复,对吗?再说一遍,不一定。主机可能较旧,只需完全更换即可。更换旧主机中的 NIC 可能会再次增加技术债务,因为主机可能会以其他方式发生故障,从而在将来的某个时刻导致网络故障。
Assume the host is required, replacing the NIC becomes the permanent fix, correct? Again, not necessarily. The host may be older and should simply be replaced entirely. Replacing the NIC in an older host may, again, simply increase technical debt, as the host may fail in some other way that causes a network failure at some point in the future.
假设主机确实需要更换。在这种情况下,更换主机应该是永久解决方案,对吗?再说一遍,不一定。现在可能是重新考虑网络设计的时候了。如果单个故障 NIC 不应导致系统范围的故障,则可能值得考虑永久修复,其中包括重新设计网络以减少命运共享或缩小故障域的范围。
Assume the host does need to be replaced. In that case, replacing the host should be the permanent fix, correct? Again, not necessarily. It could be time to reconsider the design of the network. If a single failed NIC should not be able to cause a systemwide failure, it may be worth considering a permanent fix that includes redesigning the network to reduce fate sharing or to reduce the scope of the failure domains.
因此,临时修复和永久修复的概念是灵活的。着眼于业务和业务驱动因素来思考解决问题时应该在哪里停止;不要认为更换硬件就是最终的解决方案,也不要认为每个问题都需要重新设计整个网络。
The concepts of temporary and permanent fix are, then, flexible. Look to the business and the business drivers to think through where to stop when fixing a problem; don’t assume replacing the hardware is the final fix, nor that every problem requires a complete network redesign.
半分割方法以精确的系统模型为基础,是解决任何系统中的大规模问题时的有效方法。当然,它并不完美。现实世界太混乱了,单个过程无法“完美”解决所有问题,但长期经验表明,半分割方法是快速发现问题的最佳通用指南。重申一下:
The half split method, grounded in accurate models of the system, is an effective method to use when troubleshooting large-scale problems in any system. It is not perfect, of course; the real world is far too messy for a single process to be “perfect” at solving all problems, but long experience has shown the half split method to be the best general guide to finding problems quickly. To reiterate:
• 建立准确的模型,特别是业务、应用程序、协议和设备。这可能是大多数未能有效解决问题的地方,也是需要最长时间才能完成的步骤。事实上,没有人能完成这一步可能是不言而喻的,因为每个系统总是有更多的东西需要了解,并且有更准确的方法来对任何给定的系统进行建模。
• Build accurate models, particularly the business, the applications, the protocols, and the equipment. This is probably where most failures to effectively troubleshoot problems occur, and the step that takes the longest to complete. In fact, it is probably a truism to say that no one ever completes this step, as there is always more to learn about every system, and more accurate ways to model any given system.
• 有问题/解决方案的心态。这可能是故障排除过程中第二个最常见的故障点。
• Have a problem/solution mindset. This is probably the second most common failure point in the troubleshooting process.
• 半剖开、测量并移动。
• Half split, measure, and move.
Some final points to consider:
• 切勿假设问题是由配置或设计变更造成的;永远记住设备故障、新的流量模式和其他情况可能(并且经常)导致故障。一些管理系统专注于变更控制,以至于排除其他故障模式,这可能会阻碍有效的故障排除,并随着时间的推移增加技术债务。
• Never assume the problem is a result of a configuration or design change; always remember equipment failures, new traffic patterns, and other situations can (and often do) cause failures. Some management systems focus on change control to the point of excluding other failure modes from view, which can discourage effective troubleshooting, and increase technical debt over time.
• 不要走捷径。不要从容易测试的东西开始。不要假设您在第一次测试时就发现了问题。总是试图找到一种方法来证明并试图反驳你的理论。
• Do not take shortcuts. Do not start with what can be easily tested. Do not assume you have found the problem on the first test. Always try to find a way to both prove, and attempt to disprove, your theory.
• 如果某件事看起来不对劲,那么它很可能并非如此。
• If something does not look right, it probably is not.
• 故障排除中使用的许多概念也可以应用于测试以及将事物放入网络之前的验证等。
• Many of the concepts used in troubleshooting can be applied to testing, as well—validation, etc.—before placing things into the network
故障排除是一门基于技术、知识和经验的艺术。如果事实证明学习这门艺术很困难,请不要感到沮丧;通常需要与那些经验丰富、对系统以及要提出的问题有更好了解的人一起长时间工作,才能拥有一套强大的故障排除技能。另一方面,一旦您学会了故障排除的艺术,您就不会忘记它,并且您将能够将其应用于许多不同的技术领域,而不仅仅是网络工程。
Troubleshooting is an art grounded in technique, knowledge, and experience. Do not become frustrated if it proves difficult to learn this art; it often takes long hours of work with those who have more experience and have a better understanding of the system—and of what questions to ask—to have a strong set of troubleshooting skills. On the other hand, once you learn the art of troubleshooting, you will not likely forget it—and you will be able to apply it to many different areas of technology, not just network engineering.
Day, J.网络架构模式:回归基础。培生教育,2007 年。http://books.google.com/books ?id=k9sFgIM-z6UC 。
Day, J. Patterns in Network Architecture: A Return to Fundamentals. Pearson Education, 2007. http://books.google.com/books?id=k9sFgIM-z6UC.
福勒、马丁. UML Distilled:标准对象建模语言简要指南。第三版。马萨诸塞州波士顿:Addison-Wesley Professional,2003 年。
Fowler, Martin. UML Distilled: A Brief Guide to the Standard Object Modeling Language. 3rd edition. Boston, MA: Addison-Wesley Professional, 2003.
黄彭、郭传雄、周立东、Jacob R. Lorch、Yingnong Dang、Murali Chintalapati 和 Randolph Yao。“灰色故障:云规模系统的致命弱点。” 第 16 届操作系统热点研讨会论文集,150-55。HotOS '17。美国纽约州纽约:ACM,2017。doi:10.1145/3102980.3103005。
Huang, Peng, Chuanxiong Guo, Lidong Zhou, Jacob R. Lorch, Yingnong Dang, Murali Chintalapati, and Randolph Yao. “Gray Failure: The Achilles’ Heel of Cloud-Scale Systems.” In Proceedings of the 16th Workshop on Hot Topics in Operating Systems, 150–55. HotOS ’17. New York, NY, USA: ACM, 2017. doi:10.1145/3102980.3103005.
利伯曼,诺曼。流程操作故障排除。第四版。俄克拉荷马州塔尔萨:PennWell Corp.,2009 年。
Lieberman, Norman. Troubleshooting Process Operations. 4th edition. Tulsa, OK: PennWell Corp., 2009.
Mostia, William L. Jr.故障排除:技术人员指南。第二版。国际自动化学会,2016。
Mostia, William L. Jr. Troubleshooting: A Technician’s Guide. 2nd edition. International Society of Automation, 2016.
1. 考虑网络安全上下文中描述的观察、定向、决策、行动 (OODA) 循环。如何将 OODA 循环应用于故障排除?
1. Consider the Observe, Orient, Decide, Act (OODA) loop as described in the context of network security. How could the OODA loop be applied to troubleshooting?
2. 研究灰色故障的概念(参见“扩展阅读”部分)。灰色故障应如何改变您的故障排除流程?在排除灰色故障时您会寻找什么?
2. Research the concept of a gray failure (look at the “Further Reading” section). How should gray failures change your troubleshooting process? What would you look for in troubleshooting gray failures?
3. 解释网络故障排除的方式和模型之间的区别。
3. Explain the difference between how and what models for network troubleshooting.
4. 您正在解决通过网络转发时一小部分数据包被丢弃的问题。丢弃的数据包百分比表明了解决此问题所需的工具以及需要管理的信息量?
4. You are troubleshooting a problem where a small percentage of packets are dropped when being forwarded through a network. What does the percentage of packets dropped indicate about the tools required and the amount of information you will need to manage in order to troubleshoot this problem?
5. 描述您可能在网络中发现的不同类型的信号,这些信号可用于追踪特定系统或应用程序的操作。
5. Describe different kinds of signals you might find in a network that can be used to trace out the operation of a particular system or application.
到目前为止,本书的重点是问题和解决方案。第四部分有点不同,它主要关注网络工程的一些新趋势:
Up to this point, this book has focused on problems and solutions. Part IV is a bit different, in that it primarily focuses on some of the newer trends in network engineering:
• 什么是功能虚拟化,它在业务需求和网络使用方面能实现什么目的,以及虚拟化功能如何与网络设计和性能交互?
• What is the virtualization of functions, what does this accomplish in terms of business requirements and usage of networks, and how does virtualizing functions interact with network design and performance?
• 什么是物联网?这个概念可能对网络的设计和未来产生什么影响?
• What is the Internet of Things, and what impacts might this concept have on the design and future of networks?
• 云正在从新的、令人兴奋的事物转向正常的、可操作的;什么是云计算?云是如何构建的?
• Cloud is moving from the new and exciting to the normal and operational; what is cloud computing, and how are clouds built?
• 网络变得如此之大,以至于管理员和工程师几乎实时地实际单独管理每台设备变得越来越困难。自动化和开发操作如何以及在哪里发挥作用来解决这些问题?
• Networks are becoming so large that it is becoming difficult for administrators and engineers to actually manage each piece of equipment individually in near real time. How and where do automation and development operations play a role in solving these problems?
这些章节只是对每个领域的概述;即使是这么大的书,也没有足够的篇幅来涵盖每个领域的任何细节。因此,重要的是要注意每章末尾的“进一步阅读”部分,以找到更多材料来了解每个主题。
These chapters are simply overviews of each of these areas; there is not enough space in a book, even of this size, to cover each area in any sort of detail. It is important, then, to pay attention to the “Further Reading” sections at the end of each chapter to find more material to learn about each topic.
在阅读这些章节时,您应该重点理解和分析本书中呈现的框架中的这些技术和思想。最终,就解决的问题或解决方案而言,这里并没有什么真正新鲜的东西在技术层面上提供。物理世界的约束,无论逻辑抽象的深藏程度如何,总是会对必须在现实世界中部署的任何解决方案集或设计施加现实检验。最终,您必须在设计、安全性、隐私、成本和网络用途的适用性方面寻求权衡——复杂性是不可避免的;它只能从网络中的一个地方移动到另一个地方。
When reading these chapters, you should focus on understanding and analyzing these technologies and ideas in the framework presented throughout this book. Ultimately, there is nothing really new here in terms of problems solved or solutions offered at a technology level. The constraints of the physical world, no matter how deeply buried in logical abstractions, will always impose reality checks on any solution set or design that must be deployed in the real world. Ultimately, you must look for the tradeoffs in design, security, privacy, cost, and fitness to the purpose of the network—complexity cannot be avoided; it can only be moved from one place to another in the network.
如果您在本节中的某个趋势“来来去去”之后阅读本书,您仍然应该阅读这些章节,因为它们能够让您思考更大规模、难以解决的问题。本书的目的是永恒的,因为它在 20 年后仍然是有用的学习指南和参考书(当你阅读本书时,而不是当它被写下来时)。虽然并非未来的每一个组成部分都可以在过去找到——技术和想法总是有惊喜——但基本的构建模块总是可以在过去找到。
If you are reading this book after one of the trends in this section have “come and gone,” you should still read these chapters for their ability to make you think about larger scale, hard-to-solve problems. The intent of this book is to be timeless, in that it will still be a useful learning guide and reference 20 years from now (when you are reading this, not when it is being written). While not every component of the future can be found in the past—there are always surprises in technology and ideas—the fundamental building blocks can always be found in the past.
本书这一部分的章节将帮助您理解如何将到目前为止考虑的许多部分以不同的方式组合在一起以创造出新的东西。本节的章节包括:
The chapters in this part of the book will help you understand how the many pieces considered up to this point can be put together in different ways to make something new. The chapters in this section include:
•第 25 章:分解、超融合和不断变化的网络,考虑分解在构建网络和数据中心结构中的应用
• Chapter 25: Disaggregation, Hyperconvergence, and the Changing Network, which considers the application of disaggregation to building networks, and data center fabrics
• Chapter 26: The Case for Network Automation, which considers network automation and Development Operations
•第 27 章:虚拟化网络功能,考虑网络功能虚拟化、服务链和横向扩展服务设计
• Chapter 27: Virtualized Network Functions, which considers Network Function Virtualization, Service Chaining, and scale out service design
•第 28 章:云计算概念和挑战,考虑将处理转移到公共云服务的业务驱动因素、权衡和挑战
• Chapter 28: Cloud Computing Concepts and Challenges, which considers the business drivers, tradeoffs, and challenges in moving processing to public cloud services
•第 29 章:物联网,考虑了传感器和其他连接到互联网的“物体”的广泛部署,以及这一运动带来的挑战和可能的解决方案
• Chapter 29: The Internet of Things, which considers the widespread deployment of sensors and other “things” attached to the Internet, and the challenges and possible solutions resulting from this movement
•第 30 章:展望,考虑网络工程的未来,包括对网络自动化、区块链和命名数据网络的一些进一步思考
• Chapter 30: Looking Forward, which considers the future of network engineering, including some further thoughts on network automation, block-chains, and named data networking
网络工程世界从一开始就是基于设备的;您购买了路由器、交换机或其他一些网络设备,将其安装在机架上、连接电缆、打开电源并对其进行配置,以实现您所需的功能。这与信息技术 (IT) 的其他领域有很大不同,后者始终拥有更加多样化的软件和硬件模型。本章将首先介绍更广泛的 IT 世界中的两个具体运动,然后将这些运动与网络工程联系起来。
The network engineering world has, since the very beginning, been appliance-based; you buy a router, switch, or some other piece of networking gear, you rack it, cable it, power it on, and configure it to fulfill the functions you need. This is far different than the rest of Information Technology (IT), which has always had many more diverse models of software and hardware. This chapter will begin with a look at two specific movements within the broader IT world and then relate these movements to network engineering.
在遥远的过去,计算机都是以同样的方式构建的。有一个机箱、主板、内存、硬盘、键盘和显示器。当公司开始构建网络时,他们开始将多组服务器放置在服务器机房中,包括专门为容纳 15-20 台服务器而设计的家具,以及键盘视频鼠标 (KVM) 交换机,以便使用一组输入和输出设备一次管理所有服务器。此类安装所涉及的空间量以及电力和冷却问题很快导致了专门设计的机架安装系统的使用。
In the distant past, computers were all built the same way. There was a case, a moth-erboard, memory, a hard drive, a keyboard, and a monitor. When companies started building networks, they began placing sets of servers into server rooms, including specially built furniture designed to hold 15–20 servers, and a Keyboard Video Mouse (KVM) switch so a single set of input and output devices could be used to manage all of the servers at once. The amount of space involved in such installations, along with the power and cooling problems, quickly led to the use of specially designed rack-mounted systems.
每个系统,即使是在机架安装的情况下,也是某种类型的单个独立服务器。一台服务器可能运行文件共享、目录和电子邮件服务(例如 Novell Netware、Banyan Vines、Lantastic 或 IBM OS/2 服务器)。另一台服务器上可能运行着数据库,例如 Oracle。随着需要更多资源,服务器将安装额外的内存、更大的处理器、更多的驱动器空间等。这称为扩展。
Each system, even in a rack-mounted case, was a single, standalone server of some type. One server might have file sharing, directory, and email services running (such as a Novell Netware, Banyan Vines, Lantastic, or IBM OS/2 server). Another server might have a database running on it, such as Oracle. As more resources were needed, the server would have additional memory installed, a bigger processor, more drive space, etc. This is called scaling up.
随着时间的推移,处理和存储需求变得太大,无法构建能够处理负载的单个服务器,因此应用程序被重新设计为在连接到同一网段的多个服务器上运行。这称为横向扩展。
Over time, the processing and storage requirements simply became too large to build a single server able to handle the load, so applications were redesigned to run across multiple servers connected to the same segment. This is called scaling out.
当然,最终,通过英特尔、VMware 和其他公司的努力,应用程序或进程与物理计算资源(处理器、存储和内存)断开连接,并放入虚拟机 (VM) 或更高版本的容器中。然而,这个虚拟化过程有一个副作用。
Eventually, of course, through the work of Intel, VMware, and others, the applications, or processes, were disconnected from the physical compute resources— processor, storage, and memory—and placed into virtual machines (VMs), or later, containers. This virtualization process, however, had a side effect.
一旦计算资源虚拟化,为什么它们应该位于同一台物理服务器上?例如,硬盘驱动器不需要位于物理服务器中,只要可以通过网络连接作为虚拟资源进行访问即可。因此,计算资源本身可以移动到网络上的任何地方,只要需要它们的应用程序在特定的性能要求范围内可以访问它们。
Once compute resources are virtualized, why should they be located on the same physical server? For instance, the hard drive does not need to be in the physical server, so long as it can be accessed as a virtual resource over the network connection. Thus, the compute resources themselves could be moved anyplace on the network, so long as they would be accessible, within specific performance requirements, to the applications that needed them.
这些计算资源的原始物理格式称为聚合;所有资源都集中在一个设备中。只有在物理处理器上运行的应用程序才能访问连接到处理器的资源,例如磁盘、内存和网络接口。对这些计算资源的虚拟化访问导致了分散。在分散的系统中,计算资源可以存在于任何地方,只要可以通过网络访问即可。这将横向扩展模型提升到了一个新的水平。您可以通过根据需要从连接到网络的各个系统实际提取资源来进行扩展,而不是通过跨服务器进行扩展。
The original physical format of these compute resources is called converged; all of the resources are converged in a single device. Only applications running on a physical processor can access resources such as disk, memory, and network interfaces, connected to the processor. Virtualizing access to these compute resources led to dis-aggregation. In a disaggregated system, the compute resources can live anywhere as long as they are accessible over the network. This brings the scale-out model to a new level. Rather than scaling out by crossing servers, you can scale out by actually pulling resources from various systems connected to the network as needed.
以这种方式虚拟化接口还有另一个副作用。应用程序用于连接和使用这些资源的虚拟接口本质上成为标准化的应用程序编程接口(API),这意味着只要硬件满足所需的性能指标,就没有理由购买一个品牌的硬件而不是另一个品牌的硬件。
There is another side effect of the move to virtualize interfaces in this way. The virtual interface that applications use to connect to and use these resources essentially becomes a standardized Application Programming Interface (API), which means there is no reason to buy one brand of hardware over another, so long as the hardware meets the required performance metrics.
当您可以根据其性能与价格配置购买硬件,并将其与您碰巧拥有的任何其他硬件(或软件)一起使用时,结果是设备的品牌不再受到重视。这就产生了白盒的想法——购买硬件是因为它的组件而不是品牌。当然,白盒是一个有点沉重的术语;它在某种程度上意味着几个人坐在车库里,用他们能想到的任何组件将电路板焊接在一起。这种新的“白盒运动”最好称为分解,因为可用的硬件范围很广,从基本特性和功能到完全支持的品牌设备。
When you can buy hardware for its performance versus price profile, and use it with any other hardware (or software) you happen to have, the result is that the brand of equipment is deemphasized. This leads to the idea of a white box—buying hardware because of its components rather than the brand. Of course, white box is a somewhat loaded term; it somehow implies a couple of people sitting in a garage soldering boards together from whatever components they can come up with. This new “white box movement” might better be called disaggregation, as there is a wide range of hardware available, from basic features and functionality to fully supported branded devices.
然而,分解运动有一个特定的缺点。将存储从本地系统总线移出,直接连接到处理器,迫使处理器和处理器上运行的应用程序通过网络访问存储的数据。副作用是数据访问速度变慢。尽管某些数据库的设计允许特定操作的正确数据集完全驻留在单个节点上的内存中,但通过网络将数据传入和传出磁盘仍然会给设计带来严重的限制。
The disaggregation movement, however, has a specific downside. Moving storage off the local system bus, connected directly to the processor, forces the processor and the applications running on the processor to access stored data through the network. The side effect is slower access to data. Although some databases are designed to allow the correct data set for specific operations to reside entirely in memory on a single node, carrying data to and from a disk over the network can still introduce serious limitations in a design.
解决这个问题最明显的方法是将数据移回本地系统;但是,您将失去动态构建一组计算资源的能力。这个问题的解决方案是超融合。在这里,存储、内存和网络资源仍然连接到各个处理器,但它们以允许连接到网络的所有处理器访问它们的方式进行虚拟化。通过良好的规划,可以在附近分配存储、内存和其他资源,从而将网络流量保持在最低限度,同时仍然允许使用不同的资源集构建虚拟机。
The most obvious way to solve this problem is to move the data back onto the local system; however, then you lose the ability to build a set of compute resources dynamically. The solution to this problem is hyperconvergence. Here, the storage, memory, and network resources are still connected to individual processors, but they are virtualized in a way that allows all the processors attached to the network to access them. With good planning, storage, memory, and other resources can be allocated nearby, so network traffic is kept to a minimum, while still allowing VMs to be built out of a diverse set of resources.
图 25-1说明了融合、分解和超融合架构的概念。
Figure 25-1 illustrates the concepts of a converged, disaggregated, and hyperconverged architecture.
在图 25-1中:
In Figure 25-1:
• 在传统的图示中,在左上角,每个处理器都通过本地总线连接到存储、内存和网络访问。进程上运行的应用程序可以访问这些资源。
• In the traditional illustration, in the upper-left corner, each processor is attached to storage, memory, and network access through a local bus; applications running on the process have access to these resources.
• 在右上角的融合图中,单个进程通过总线连接到存储、内存和网络访问。利用处理器特性创建多个虚拟机;这些虚拟机中的每一个都运行可以访问连接到本地处理器总线的资源的应用程序。
• In the converged illustration, in the upper-right corner, a single process is attached to storage, memory, and network access through a bus. Multiple virtual machines are created using processor features; each of these virtual machines runs applications that can access the resources attached to the local processor bus.
• 在分解图中的左下角,存储已集中到可通过网络访问的设备上。在各种处理器上运行的虚拟机访问本地内存和网络资源,但通过网络连接到存储,而存储可通过本地处理器的总线进行访问。
• In the disaggregated illustration, in the lower-left corner, the storage has been centralized onto a device reachable through the network. Virtual machines running on the various processors access local memory and network resources but connect to storage through the network, which is accessible through the local processor’s bus.
• 在右下角的超融合图中,每个虚拟机都在特定处理器上运行,访问通过本地总线连接到处理器的内存和网络资源。每个处理器上还运行一个代理,它将本地连接的存储重定向到基于网络的接口,并根据这些资源提供单个存储池。存储管理器通常会尝试将数据定位到尽可能靠近使用它的处理器的位置。
• In the hyperconverged illustration, in the lower-right corner, each virtual machine runs on a particular processor, accessing memory and network resources connected to a processor through the local bus. An agent runs on each processor, as well, which redirects the locally attached storage to a network-based interface, and presents a single storage pool based on these resources. The storage manager will often attempt to locate data as close as possible to the processor using it.
笔记
Note
在虚拟化背景下,术语“处理器”可能会令人困惑。许多主机包含一到四个处理器,每个处理器包含一到八个内核。单个虚拟机或容器可以在单个核心上运行,消耗单个主机中的所有处理器和核心,或者以上的任意组合。然而,为了简化这里的解释,选择术语“处理器”来表示虚拟机或容器可以在其上运行的任何潜在的核心和/或处理器集。
The term processor can be confusing in the context of virtualization. Many hosts contain one to four processors, with each processor containing one to eight cores. A single VM or container may run on a single core, consume all the processors and cores in a single host, or any combination of the above. To simplify the explanation here, however, the term processor was selected to represent any potential set of cores and/or processors a VM or container may run on.
您可能会注意到这些插图特别关注存储位置。这是因为存储不仅往往是最容易在网络中移动的资源,而且通常也是一种可以通过某种形式的集中化节省大量费用的资源。例如:
You might note these illustrations focus specifically on the location of storage. This is because storage not only tends to be the easiest resource to move around the network, but it is also often one resource where you can save a lot of expense through some form of centralization. For instance:
• 虽然数据可以在多个设备上进行压缩,但通常最好运行能够动态压缩数据到存储或从存储中解压缩数据的专用硬件。与通用处理器相比,此类专用硬件不仅可以更快地运行压缩,而且可以调整为压缩更深,并且在压缩过程中使用更少的能量。
• While data can be compressed on multiple devices, it is often better to run specialized hardware able to compress and decompress data to and from storage on the fly. Such specialized hardware can not only run compression much faster, but it can be tuned to compress more deeply and use less energy in the compress process than a general-purpose processor.
• 这同样适用于加密;大多数现代处理器当然可以在写入数据时处理加密数据,在读取数据时处理解密数据,但专用处理器通常效率更高,如果涉及大量存储,则值得投资。
• The same holds true for encryption; most modern processors can certainly handle encrypting data while it is being written and deencrypting data while it is being read, but specialized processors are often so much more efficient, they are worth the investment if large amounts of storage are involved.
• 重复数据删除可以减少存储使用量,从而降低成本。比方说,如果发送一份带有 1MB 附件的公司备忘录,并且有 1,000 人保存它,则结果将消耗 1GB 的存储空间。重复数据删除系统可以保存附件的一份副本,将每个“副本”替换为指向单个副本的点,从而保存附件的 999 个副本。重复数据删除适用于操作系统文件、应用程序、数据库和任何其他类型的信息;在许多情况下,它可以显着降低存储需求。
• Data deduplication can reduce the amount of storage used, also reducing costs. If, say, a company memo is sent with a 1MB attachment, and 1,000 people save it, the result will be 1GB of storage consumed. A data deduplication system can save one copy of the attachment, replacing each “copy” with a point to the single copy, saving 999 copies of the attachment. Data deduplication works for operating system files, applications, databases, and any other sort of information; it can dramatically decrease storage requirements in many cases.
在这些解决方案中,应用程序仍然仅限于附加到本地处理器的物理内存和网络资源。在可组合系统中,甚至这些资源也可以在处理器之间共享。图 25-2说明了构建可组合系统的一种方法。
In each of these solutions, applications are still limited to the physical memory and network resources attached to the local processor. In a composable system, even these resources can be shared among processors. Figure 25-2 illustrates one way to build a composable system.
在图25-2中,处理器总线已被扩展,因此它具有许多可以连接的不同处理器、网络接口、内存组和存储设备。系统管理器从这个大型资源池中组成资源集,供各个虚拟机运行。并非所有可组合系统都使用扩展处理器总线采用这种方式;有些将每个单独的设备连接到内部以太网,使用网络将信息从处理器传输到外部网络接口或存储设备。这种配置允许系统扩展到非常大的规模,同时继续将每个资源视为白盒;每个设备是谁制造的并不重要,只要它们都能向可组合系统管理器提供一组统一的 API 即可。
In Figure 25-2, a processor bus has been extended so it has many different processors, network interfaces, memory banks, and storage devices that can be attached. A system manager composes sets of resources out of this large pool of resources for individual virtual machines to run on. Not all composable systems use an extended processor bus in this way; some attach each individual device to an internal Ethernet network, using the network to transport information from the processor to the external network interface, or a storage device. This sort of configuration allows a system to scale out to very large sizes, while continuing to treat each resource as a white box; it does not matter who makes each device, so long as they can all present a uniform set of APIs to the composable system manager.
第二次分解运动是服务器级别虚拟化的结果——应用程序分解。大多数应用程序的设计在单个设备上运行,可以完全访问整个本地硬件资源,并从头到尾完成任务。例如,跟踪客户订单的应用程序可能会将客户信息、产品信息、当前订单、过去订单、库存等全部保存在一组数据库中。随着时间的推移,此类应用程序被分解为数据库后端和业务逻辑前端,但这两个部分仍然是某种程度上统一的、可识别的应用程序。
A second disaggregation movement happened as a result of virtualization at the server level—the disaggregation of applications. Most applications were designed to run on a single device, with full access to an entire range of local hardware resources, and to complete a task from start to finish. For instance, an application tracking customer orders might hold customer information, product information, current orders, past orders, inventory, etc., all in a single set of databases. Over time, such applications were broken into a database back end and a business logic front end, but these two pieces were still somewhat unified, identifiable applications.
随着虚拟化的兴起,将应用程序分解为多个部分,每个部分运行在一组虚拟服务器上开始变得更有意义。通过这种方式,应用程序的任何部分都可以扩展以满足需求,或者在需求较低时缩小规模以使用更少的资源——这是横向扩展的另一个版本,但就应用程序而言。
With the rise of virtualization, it started to make more sense to break up an application into many pieces, with each piece running on a set of virtual servers. In this way, any piece of the application could be scaled up to meet demand or scaled back to use less resources when demand was low—another version of scale out, but in terms of applications.
将应用程序分解为较小的部分,每个部分代表较大应用程序中的一个服务,然后将这些应用程序互连,最终形成微服务,这是一种计算形式,其中应用程序的每个单独模块被分解为较小的应用程序,每个应用程序这在一件事上做得很好。这些应用程序通过网络连接,因此应用程序实际上在整个网络上运行。
Breaking an application into smaller pieces, each of which represents a single service within the larger application, and then interconnecting those applications eventually leads to microservices, a form of computing where each individual module of an application is broken out into a smaller app, each of which does one thing very well. The apps are connected over the network so the application actually runs on the entire network.
此类系统不仅能够很好地扩展,而且还可以以更优雅的方式管理变更和故障。如果数据中心网络中的单个主机或路由器发生故障,它可能只代表整个应用程序处理的一小部分;与运行整个处理系统发生故障的单个主机相比,此类故障更容易处理。
Not only do such systems tend to scale out well, but they can also manage change and failure in more graceful ways. If a single host or router in the data center network fails, it will likely represent just some small part of the processing the entire application does; such failures can more easily be dealt with than a single host that runs an entire processing system failing.
服务器硬件的分解、超融合以及虚拟化服务而非传统意义上的应用程序的趋势这三种趋势对数据中心网络的设计参数产生了显着影响。本节将具体考虑其中的两个变化:东/西流量的增加以及网络中抖动和延迟敏感性的增加。
These three trends—the disaggregation of server hardware, hyperconvergence, and the trend toward virtualized services rather than applications in the traditional sense—have had a marked impact on the design parameters for data center networks. This section will consider two of these changes specifically: the rise of east/west traffic and the rise of jitter and delay sensitivity in the network.
在融合和虚拟化融合系统中,网络主要用于承载进出主机的流量,无论主机是否虚拟化。实际上,服务器是网络的黑匣子;各种流量从外界进入设备,流量从设备传输到外界。从数据中心外部传入和传出服务器的流量称为北/南流量,因为它按照“传统绘制的方式”在网络图的顶部和底部之间传输。图 25-3说明了这一点。
In converged and virtualized converged systems, the network is primarily used for carrying traffic to and from hosts, whether the host is virtualized or not. The server is, in effect, a black box to the network; traffic of various sorts enters the device from the outside world, and traffic is transmitted from the device to the outside world. Traffic being carried to and from servers from outside the data center is called north/south traffic, as it is traveling between the top and bottom of the network diagram as “traditionally drawn.” Figure 25-3 illustrates.
在图25-3中,整个服务器H看起来是一个黑盒子;存储、内存、处理器和网络接口之间的移动流量是通过处理器总线处理的,处理器总线实际上是一个小型内部网络。该网络中的主要流量将沿着网络的北/南轴从 A 流向 H,然后再返回。
In Figure 25-3, the entire server H appears to be one black box; moving traffic between storage, memory, processor, and the network interface is handled through the processor bus, which is, in effect, a small internal network. The primary traffic flows in this network will be from A to H and back again, which is along the north/south axis of the network.
图 25-4说明了通过分解来集中存储时会发生什么情况。
Figure 25-4 illustrates what happens when the storage is centralized through disaggregation.
在图25-4中,每当处理器需要将信息从存储复制到内存时,数据就必须通过网络传输。该数据称为东/西流量,因为它从连接到数据中心网络的一台设备流向另一台设备。将应用程序分解为服务(以及潜在的微服务)与分解硬件资源具有相同的效果。结合这两个现实,来自主机(例如 A)的单个请求将代表少量的北/南流量,但将驱动大量的东/西流量。
In Figure 25-4, any time the processor needs to copy information from storage into memory, the data must travel across the network. This data is called east/west traffic, as it is flowing from one device connected to the data center network to another. The disaggregation of applications into services, and potentially microservices, has the same effect as the disaggregation of hardware resources. Combining these two realities, a single request from a host, such as A, will represent a small amount of north/south traffic but will drive a lot of east/west traffic.
还有多少?大多数网络和超大规模网络运营商报告大约为 10 比 1 的比率 - 对于每一位北/南流量,将有大约 10 位东/西流量。对于网络规模的网络来说,每天传输多个太比特的数据以响应数百吉比特的实际用户请求并不罕见。
How much more? Most web and hyperscale network operators report about a 10-to-1 ratio—for each bit of north/south traffic, there will be about 10 bits of east/west traffic. It is not unusual for web scale networks to carry multiple terabits of data a day in response to several hundred gigabits of actual user requests.
应用程序和计算资源的分解导致网络的抖动和延迟成为一个非常大的问题。具体来说:
The disaggregation of applications and compute resources has caused jitter and delay through the network to become a very big problem. Specifically:
• 一旦将存储与其余计算资源分开,应用程序的性能和网络的性能就具有内在的联系。例如,如果网络拥塞,甚至需要几分之一秒的时间才能将数据从存储设备传输到处理器,那么对应用程序性能的影响可能是毁灭性的。
• Once you separate the storage from the rest of the compute resources, the performance of the application and the performance of the network are intrinsically linked. If the network is congested, for instance, taking even some fraction of a second to transfer data from a storage device to a processor, the impact on the performance of the application can be devastating.
• 一旦将应用程序分解为服务并转向微服务,任何一项服务的性能都会影响整个应用程序的性能。
• Once you break up the application into services and move toward microser-vices, the performance of any one service will impact the performance of the entire application.
考虑这个问题的一个方便的方法是:处理器总线本身已经扩展到数据中心网络上。作为一个整体,应用程序现在在网络上运行,就像以前在单个主机或单个设备上运行一样。网络作为一个整体,现在是一个系统,必须被视为一个系统。
A convenient way to think about this is: The processor bus itself has been extended over the data center network. The application, as a whole, is now running on the network in the same way it once ran on a single host or within a single device. The network, as a whole, is now a system and must be treated as a system.
网络中的任何延迟或抖动都可能通过系统级联,导致整个应用程序性能不佳。当您的收入取决于用户参与度,而用户参与度又取决于应用程序加载速度时,数据中心网络中的任何问题都会直接表现为收入损失。
Any delay or jitter in the network can cascade through the system, causing the entire application to perform poorly. When your revenue depends on user engagement, and user engagement depends on the speed at which your application loads, any problem in the data center network shows up directly as a loss of revenue.
如何调整网络架构以满足在网络本身上运行的应用程序的要求,将网络视为一个系统?为了解决这个问题,网络工程师回到了一些关于构建电路交换网络的最佳方法的旧想法,将它们与分组交换原理相结合以创建分组交换结构。本节将考虑面料设计的一些方面。
How can network architectures be adapted to meet the requirements of an application running on the network itself, treating the network as a system? To solve this problem, network engineers returned to some old ideas about the best way to build circuit switched networks, merging them with packet switching principles to create the packet switched fabric. This section will consider some aspects of fabric design.
结构为何是网络的特例?首先,最好放弃“面料”一词的各种营销用途,例如
How is a fabric a special case of a network? To begin, it is best to discard various marketing uses of the term fabric, such as
• 具有重叠虚拟拓扑的任何网络,包括“核心结构”和“园区结构”
• Any network with an overlaid virtual topology, including a “core fabric” and a “campus fabric”
• Any high-performance network, with high performance meaning high bandwidth
• 具有大量等价多路径 (ECMP) 可用性的任何网络
• Any network with a lot of equal cost multipath (ECMP) availability
•整个网络被视为单个“事物”而不是一组单独组件的任何网络
• Any network where the entire network is treated as a single “thing,” rather than as a set of separate components
面料概念的这些用途几乎总是归结为营销;工程师和管理人员已经习惯了某种特殊网络的结构,因此比“普通的老式网络”更受欢迎。这很像 20 世纪 90 年代中期围绕将路由器称为“第 3 层交换机”的营销热潮,因为它在硬件中执行了标头重写。前面列表中的最后一个定义 - 任何被视为单个“事物”的网络 - 非常聪明,因为它意味着你不能用单独的组件构建结构。相反,在这个定义中,织物是您必须作为一个单元从供应商处作为“一件物品”购买的东西。
These uses of the concept of a fabric almost always come down to marketing; engineers and managers have become comfortable with a fabric being some sort of special network and hence more desirable than a “plain old-fashioned network.” This is much like the marketing craze in the mid-1990s around calling a router a “layer 3 switch” because it performed a header rewrite in hardware. The last definition in the preceding list—any network treated as a single “thing”—is very clever, because it implies you cannot build a fabric out of individual components. Rather, in this definition, a fabric is something you must buy as a unit from a vendor as “one thing.”
抛开这些营销定义不谈,是什么让网络成为一种结构?织物具有三个具体特征:
Leaving aside these sorts of marketing definitions, what makes a network a fabric? There are three specific characteristics of a fabric:
• 拓扑的规律性
• The regularity of the topology
• 拓扑在带宽和连接方面的扩展方式
• The way in which the topology scales in bandwidth and connectivity
• 拓扑设计在转发方面要实现的具体性能目标
• The specific performance goals the topology is designed to fulfill in terms of forwarding
每一个都值得仔细研究。
Each of these deserves a closer look.
拓扑规律性是指网络的拓扑结构明确且重复。说拓扑是重复的,就是说该拓扑由大量重复的相同部分组成,以创建所需的规模;图 25-5说明了这一点。
Topological regularity means the topology of the network is well defined and repeating. To say a topology is repeating is to say the topology consists of a large number of identical pieces repeated to create the scale required; Figure 25-5 illustrates.
规则拓扑和不规则拓扑之间的区别应该很明显:
The difference between the regular and irregular topologies should be apparent:
• 如果将[A1,A2,B1,B2]“拾取”为一个单元,并将A1 移动到与B2 相同的位置,则这两个拓扑是相同的。事实上,[A1,A2,B1,B2];[B1,B2,C1,C2];[A2,A3,B2,B3];和 [B2,B3,C2,C3] 是较大拓扑的相同“子拓扑”。这些子拓扑中的每一个都可以在更大的拓扑中互换。
• If you “pick up” [A1,A2,B1,B2] as a unit, and move A1 to the same position as B2, the two pieces of the topology are identical. In fact, [A1,A2,B1,B2]; [B1,B2,C1,C2]; [A2,A3,B2,B3]; and [B2,B3,C2,C3] are identical “subtopologies” of the larger topology. Each of these subtopologies is interchangeable within the larger topology.
• 所示的第二个网络拓扑中的[D1,D2,E1,E2]也是如此。这四个路由器可以移动到网络中的任何其他位置,而无需对整体拓扑进行任何修改。
• The same is true of [D1,D2,E1,E2] in the second network topology illustrated; these four routers can be moved to any other position in the network without any modifications to the overall topology.
• 然而,[G1,G2,H1,H2] 在第三个网络中是唯一的,位于插图的左下角。网络中没有其他地方具有相同的拓扑。这是一个不规则的拓扑结构。
• [G1,G2,H1,H2], however, is unique within the third network, at the lower-left corner of the illustration. There is no other place in the network with the same topology. This is an irregular topology.
• 尽管[L1,L2,M1,M2] 与[M2,M3,N2,N3] 具有相同的拓扑,但这四组路由器均不具有与[L2,L3,M2,M3] 相同的拓扑。同样,这是一个不规则的拓扑。
• While [L1,L2,M1,M2] has the same topology as [M2,M3,N2,N3], neither of these sets of four routers has the same topology as [L2,L3,M2,M3]. Again, this is an irregular topology.
为什么在决定网络是否为结构时这是一个重要点?首先,因为结构通常被设计为在子拓扑级别使用完全可复制的硬件、软件和配置。您可以将其视为微模块化的一种形式,也许网络的每个部分都被设计为可以完全复制为非常小的部分,主要是为了便于配置和管理,而不是为了打破故障域。在大规模结构设计中,物理布局与网络的逻辑布局尽可能分开。
Why is this an important point when deciding if network is a fabric? First, because fabrics are generally designed to use completely replicable hardware, software, and configurations at the subtopology level. You can think of this as a form of micro-modularization, perhaps, with each piece of the network designed to be fully replicable in very small pieces primarily for ease of configuration and management, rather than for breaking up failure domains. In fabric designs at scale, the physical layout is separated from the logical layout of the network as much as possible.
网络拓扑的扩展特性是结构的第二个标志。具体来说,织物倾向于横向扩展而不是纵向扩展。这两个概念已经讨论了与服务器和应用程序相关的问题。它们如何应用于网络?如图 25-6所示。
The scaling characteristics of the network topology are the second marker of a fabric. Specifically, fabrics tend to scale out instead of scale up. These two concepts have already been discussed in relation to servers and applications. How do they apply to a network? Figure 25-6 illustrates.
图中的上层网络配置为结构,而下层网络配置为分层拓扑。当前的问题是,如何添加足够的带宽来连接新的设备群?在下层网络中,采用分层设计,可以在网络边缘添加新的汇聚路由器,并在那里连接新设备。然而,添加这个新路由器和新设备也可能意味着需要增加网络核心的带宽以支持额外的负载。一般来说,这意味着添加更多链路,或者可能添加并行链路,并运行 ECMP 或以某种方式绑定链路。无论哪种情况,这都意味着更大的端口或更多的端口、更高速度的链路等。必须更换或增强旧设备以增加功能。
The upper network in the illustration is configured as a fabric, while the lower one is configured in a hierarchical topology. The problem at hand is, how do you add enough bandwidth to connect a new pod of equipment? In the lower network, the hierarchical design, you can add a new aggregation router at the edge of the network, and connect the new equipment there. However, adding this new router and new equipment may also mean the bandwidth in the core of the network needs to be increased to support the additional load. Generally, this means adding more links or perhaps adding parallel links and either running ECMP or bonding the links in some way. In either case, this means larger ports or more ports, higher-speed links, etc. The older equipment must either be replaced or augmented to add capabilities.
在上层网络中,添加单个新 Pod 需要向网络添加三个新路由器以及与新路由器关联的链路。然而,总带宽网络的容量随着新连接点的添加而增加。因此,网络可以通过添加更多同类设备来扩展,而不是通过修改现有设备。添加更多模块与更换或增强现有设备之间的区别是纵向扩展和横向扩展之间的关键区别。
In the upper network, adding a single new pod requires adding three new routers to the network and the links associated with the new routers. However, the total bandwidth of the network increases as the new connection point is added. Hence, the network scales by adding more equipment of the same kind, rather than by modifying the existing equipment. The difference between adding more modules and replacing or augmenting existing equipment is the key differential between scaling up and scaling out.
一般的经验法则是:织物向外扩展,而不是向上;分层设计是向上扩展,而不是向外扩展。当然,这并不总是正确的。结构确实具有基于连接到每个设备的端口数量的扩展限制,并且可以构建其他设计,以便在必须增强或更换硬件之前它们具有某种“横向扩展”措施,但一般规则在大多数情况下都适用。
The general rule of thumb is this: fabrics scale out, rather than up; hierarchical designs scale up, rather than out. This is not always true, of course; fabrics do have a scaling limit based on the number of ports connected to each device, and other designs can be built so they have some measure of “scale out” before the hardware must be augmented or replaced, but the general rule holds in most cases.
性能目标是网络和结构之间的第三个区别。网络的性能目标通常围绕服务质量 (QoS) 处理和正常运行时间。织物具有类似但有时略有不同的性能目标。例如:
Performance goals are the third differentiator between a network and a fabric. Networks typically have performance goals centering around Quality of Service (QoS) handling and uptime. Fabrics have similar, but sometimes slightly different, sorts of performance goals. For instance:
• 故障率通常根据 Pod 和/或结构的其他组件来衡量,而不是“整个网络”或特定应用程序。大多数设计为在超或网络规模结构上运行的应用程序都设计为能够容忍在服务器机架之间移动,因此可以通过将应用程序移动到连接到服务器的不同机架或容器来解决单个机架、容器或链路故障。织物。
• Failure rates are often measured in terms of the pods and/or other components of the fabric, rather than the “entire network,” or a particular application. Most applications designed to run on hyper- or web-scale fabrics are designed to tolerate being moved between racks of servers, so a single rack, pod, or link failing can be countered by moving the application to a different rack or pod attached to the fabric.
• 将工作负载移动到结构中的不同位置通常会对结构提出难以管理的移动性要求。移动性通常不是其他网络拓扑中的一个因素。必须非常快速地处理在织物上移动的工作负载;应用程序用户通常不会等待网络在发生故障的机架或 Pod 周围汇聚。
• The movement of workloads to different places in the fabric places an often difficult-to-manage mobility requirement on fabrics. Mobility is not often a factor in other network topologies. Workload moves on a fabric must be dealt with very quickly; application users do not often wait for the network to converge around a failed rack or pod.
• 结构设计通常关注结构的超额订阅,这意味着网络核心中的可用带宽量与边缘端口处的可用带宽量相比。例如,如果边缘交换机或路由器(称为架顶式 [ToR] 或叶)向服务器提供 320Gb 的带宽,但只有 180Gb 的结构连接,则被描述为 2:1 超额订阅。描述超额订阅的另一种方式是根据从结构上的任何端口到任何其他端口的可用带宽。如果 ToR 上的每个端口都可以以全速率向连接到结构的其他一组端口发送流量,则该结构被称为具有 1:1(或无)超额订阅。
• Fabric design is often focused on the fabric’s oversubscription, which means the amount of bandwidth available in the network core compared to the amount of bandwidth available at the edge ports. For instance, if an edge switch or router (called a Top of Rack [ToR] or leaf) offers 320Gb of bandwidth down to servers, but has just 180Gb of fabric connections, it is described as being 2:1 oversubscribed. Another way to describe oversubscription is in terms of how much bandwidth is available from any port to any other port on the fabric. If every port on a ToR can send traffic at a full rate to some other set of ports attached to the fabric, then the fabric is said to have 1:1 (or no) oversubscription.
• 许多网络设计侧重于使用流量工程和服务质量技术来减少延迟。另一方面,结构尝试减少抖动和延迟,并且主要尝试减少端到端队列,而不是实现任何类型的复杂 QoS。然而,许多结构确实使用某种形式的流量工程。
• Many network designs focus on reducing delay using traffic engineering and Quality of Service techniques. Fabrics, on the other hand, try to reduce jitter as well as delay, and mostly try to reduce end-to-end queueing, rather than implementing any sort of complex QoS. Many fabrics do, however, use some form of traffic engineering.
构建织物最常用的拓扑之一是脊柱和叶子,这并不是真正的单一设计,而是基于相同基本构建块的一系列设计。图 25-7展示了基本的脊叶设计。
One of the most commonly used topologies to build fabrics is the spine and leaf, which is not really a single design, but rather a family of designs based on the same basic building block. Figure 25-7 illustrates a basic spine and leaf design.
底部和顶部阶段称为机架顶部 (ToR) 或叶节点;这些是主机和其他设备连接到网络的地方。其余的阶段通常被称为某种脊柱;标准脊柱和叶子中的脊柱有两个规则:
The bottom and top stages are called either Top of Rack (ToR) or leaf nodes; these are where hosts and other devices are connected to the network. The remaining stages are generally called spines of some sort; there are two rules for spines in a standard spine and leaf:
• 主干路由器之间没有连接。
• There are no connections between spine routers.
• 没有任何类型的设备连接到主干路由器;与结构的所有连接都通过叶节点进行。
• No devices of any sort are connected to spine routers; all connectivity into the fabric is carried through a leaf node.
另一种编号形式显示在织物的右侧。织物可以拉伸折叠,但阶段数是根据穿过织物的总距离给出的;图中所示的织物是五层织物或三层织物。在某些脊柱和叶子拓扑结构中,级数可能会令人困惑。
An alternate form of numbering is shown on the right side of the fabric. Fabrics can be drawn folded, but the stage count is given based on the total distance through the fabric; the fabric shown in the illustration is a five-stage or ary fabric. The number of stages can be confusing in some configurations of spine and leaf topologies.
笔记
Note
一些主干和叶子设计确实在主干路由器之间有连接;当您在交换矩阵上聚合路由时,这可以解决一些问题,但它也会增加网络设计和控制平面聚合的复杂性。
Some spine and leaf designs do have connections between spine routers; this can solve some problems when you are aggregating routes on the fabric, but it also can add a lot of complexity into the network design and control plane convergence.
在标准配置中,如图25-7所示,添加级并不会真正添加更多端口;相反,你会扩展这种结构。扩展限制是结构中每个设备上可用的端口数量,超额订阅率是 ToR 路由器提供的带宽量与从 ToR 路由器进入结构的可用带宽量之间的差值。
In the standard configuration, as shown in Figure 25-7, adding stages does not really add more ports; instead you would scale out this kind of fabric. The scaling limit is the number of ports available on each device in the fabric, and the oversubscription rate is the difference between the amount of bandwidth offered by the ToR routers and the amount of bandwidth available from the ToR routers into the fabric.
脊叶结构的关键点之一是它们不需要复杂的控制平面来正确转发流量。虽然大多数超大规模网络确实使用复杂的控制平面,但它通常用于补偿交叉链路、为覆盖虚拟网络提供信息或提供某种形式的流量工程。
One of the key points about spine and leaf fabrics is they do not need a complex control plane to forward traffic correctly. While most hyperscale networks do use a complex control plane, it is normally used to compensate for cross links, to provide information for an overlay virtual network, or to provide for some form of traffic engineering.
织物世界中另外两个有用的概念是树的类型。瘦树结构在结构内的每个阶段之间具有相同的链路(但这不包括为服务器配置的端口)。胖树结构在结构的中心级(通常在超级主干和主干之间)使用少量高速链路,并在主干和 ToR 设备之间使用较低速链路。无论哪种情况,都适用相同的超额认购概念;两者之间的主要区别在于光学,仅在于布线数量以及结构中各种设备上端口的配置方式。
Two other helpful concepts in the fabric world are the type of tree. Skinny tree fabrics have the same link between every stage within the fabric (this does not include the ports provisioned for servers, however). Fat tree fabrics use a smaller number of higher-speed links in the center stage of the fabric, generally between the super spine and the spines, and lower-speed links between the spines and the ToR devices. In either case, the same oversubscription concepts apply; the primary difference between the two is optical, just in the amount of cabling and how the ports are configured on the various devices in the fabric.
为什么您需要在没有超额订阅的结构上部署流量工程?用图25-9来说明。
Why would you ever need to deploy traffic engineering on a fabric designed with no oversubscription? Figure 25-9 is used to illustrate.
在图25-9中,假设A有一些大流量发往C,几乎消耗了A进入结构的所有本地链路,并且是持久的;它会持续超过两三秒,可能会持续几天或几个月。在数据中心结构的上下文中,这些大型、持续的流被称为大象流。一般来说,大象流与大型数据传输相关(例如涉及在结构上移动 Hadoop 作业或数据库复制的数据传输),并且对抖动不敏感。假设该流放置在路径 [V1, W2, X3, Y2, Z1] 上。在这个大象流持续期间的某个时刻,B 启动一个具有低带宽使用要求的短会话,以支持延迟或抖动敏感的应用程序。假设该流放置在路径 [V2, W2, X3, Y2, Z3] 上。
In Figure 25-9, assume A has some large flow destined to C, consuming just about all of A’s local link into the fabric, and is persistent; it will last for more than two or three seconds, potentially into the realm of days or months. These large, persistent flows are called elephant flows in the context of a data center fabric. Generally, elephant flows relate to large data transfers (such as those involved in moving a Hadoop job around on the fabric, or a database replication), and are not sensitive to jitter. Assume this flow is placed on the path [V1, W2, X3, Y2, Z1]. At some point during the duration of this elephant flow, B starts a short session with low bandwidth use requirements, in support of an delay or jitter sensitive application. Assume this flow is placed on the path [V2, W2, X3, Y2, Z3].
这两个流将共享 [W2, X3] 和 [X3, Y2] 链接。考虑到这两个流的性质,较小的流(有时称为鼠标流)将无法满足其抖动要求,即使其他流上有足够的可用带宽。织物上的路径。为了解决这个问题,需要将大象流固定到单个路径,并且不考虑该路径以供通过网络的其他流使用。这是非竞争数据中心结构中流量工程的主要用例。
Both of these flows are going to share the [W2, X3] and [X3, Y2] links. Given the nature of the two flows, the smaller flow, sometimes called a mouse flow, will not meet its jitter requirements even if there is plenty of bandwidth available on other paths on the fabric. To resolve this, the elephant flow needs to be pinned to a single path, and the path taken out of consideration for use by other flows passing through the network. This is the primary use case for traffic engineering in a noncontending data center fabric.
许多大型(网络或超)规模网络使用蝶形结构,它是 Benes 的变体,也是脊叶结构的一种。图 25-10显示了一个小例子。
Many large (web- or hyper-) scale networks use a butterfly fabric, which is a variant of a Benes, and also a type of spine and leaf fabric. Figure 25-10 shows a small example.
在图25-10中,有两个结构(每个结构也可以称为核心)和一组ToR设备。每块布料都是完整的书脊和叶子;每个 ToR 连接到每个结构中的一个点。根据您的观点,这可以被视为五级结构,ToR 到 ToR,或者可以被视为具有一组附加访问设备的三级结构(尽管您仍然永远不会将任何设备或外部访问连接到该网络中两个结构的叶节点 - 所有连接都将通过 ToR 路由器或交换机之一进行。
In Figure 25-10, there are two fabrics, each of which might also be called a core, and a set of ToR devices. Each fabric is a full spine and leaf; each ToR connects to one point in each fabric. Depending on your perspective, this can be considered a five-stage fabric, ToR to ToR, or it can be considered a three-stage fabric with an additional set of access devices (though you would still never connect any devices or external access to the leaf nodes of the two fabrics in this network—all connectivity would be through one of the ToR routers or switches).
这种设计的主要优点是超额订阅率和扩展可以在 ToR 设备的交换矩阵侧端口的限制内进行调整。要降低超额认购率,请增加核心数量。要扩大规模,请增加并行内核和 ToR 设备的数量。
The primary advantage of such a design is the oversubscription rate and scaling can be adjusted within the limits of the fabric side ports of the ToR devices. To decrease the oversubscription rate, increase the number of cores. To increase the scale, increase the number of cores and ToR devices in parallel.
分解引发了计算资源的构建和使用方式的革命。这些相同的概念可以应用于网络吗?
Disaggregation has caused a revolution in the way compute resources are built and used. Can these same concepts be applied to the network?
网络“从古至今”都是由设备构建的。设备是从供应商处购买的,通过某种半专有接口进行机架安装、布线、通电和配置。每个设备都有相当独特的功能集;事实上,软件和硬件相结合的功能集是主要卖点,因为广泛的功能(和书呆子旋钮)允许单个设备在各种网络中、在各种条件下使用。状况。这种能力使设备“面向未来”,无论您向设备提出什么问题,它都可能具有可以启用的某些功能来“解决”问题(对于“解决”的某些值) 。
Networks have, “since forever,” been built out of appliances. A device is purchased from a vendor, racked, cabled, powered on, and configured through some sort of semiproprietary interface. Each device has a fairly unique feature set; in fact, the feature set of the software and hardware combined is the primary selling point, because the wide range of features (and nerd knobs) allows a single piece of gear to be used in a wide variety of networks, under a wide variety of conditions. This ability makes the appliance “future proof,” in the sense that no matter what problem you throw at the appliance, it is likely to have some feature that can be enabled to “solve” the problem (for some value of “solve”).
结果是一个工程世界
The result is an engineering world that
• 无论当前是否需要解决特定问题,都会追逐功能,这在许多情况下会导致过度设计。
• Chases features whether or not they are needed to solve a particular problem right now, leading to overengineering in many cases.
• 追求服务和支持,因为设备本身非常复杂,而由它们构建的网络往往会使用世界上其他地方找不到的功能组合;因此,每个网络都是相同的,但每个网络又是完全独特的。
• Chases service and support, because the devices themselves are so complex, and the networks built from them tend to use a combination of features found nowhere else in the world; hence each network is the same and yet each network is completely unique.
• 将设计和架构工作分摊给供应商和运营商,供应商通过为最广泛的受众构建产品来塑造架构,运营商尝试使用尽可能多的方钉,因为这就是供应商提供的产品,无论形状如何的问题。
• Splits the work of design and architecture between the vendor, who shapes architecture by building products for the widest possible audience, and the operator, who tries to use as many square pegs as possible, because this is what vendors offer, regardless of the shape of the problem.
纵观计算资源的历史,这些问题与分解旨在解决的问题完全相同。那么,也许网络中的分解可以帮助解决这些相同的问题。然而,在研究网络中的分解之前,还需要考虑来自计算和应用程序的另一个教训。
Looking over the history of compute resources, these are precisely the same problem set that disaggregation was designed to solve. Perhaps, then, disaggregation in the network can help solve these same problems. There is one more lesson from the compute and applications to consider before looking at disaggregation in the network, however.
应用程序和计算资源中的分解看起来并不相同;这主要是由于每种系统的物理限制,以及有一些地方可以提高效率。鉴于这种经验,分解在网络中可能看起来会有所不同,同时仍然会推动相同类型的效率和运营收益。
Disaggregation does not look the same in applications and compute resources; this is primarily due to the physical limitations of each kind of system, and where there are points at which efficiency can be improved. Given this experience, disaggregation will probably look different in the network, while still driving the same sorts of efficiency and operational gains.
应用程序和计算资源分解的关键点是
The key points in disaggregation in applications and compute resources have been
• 将硬件与软件解耦
• Decoupling hardware from software
• 硬件商品化,使其可用于更广泛的功能
• Commoditizing the hardware so it is usable across a wider range of functionality
• 专业化软件(例如基于服务和微服务的应用程序开发)
• Specializing the software (such as services- and microservices-based application development)
• 使用横向扩展设计原则,根据需要汇集资源来解决特定问题
• Pooling resources as needed to solve specific problems using the principle of scale-out design
如何将这些应用到网络中?第一步是考虑在哪里可以解耦软件和硬件,这将推动其余步骤。回顾一下路由器的构建方式可能会有所帮助;如图 25-11所示。
How can these be applied to the network? The first step is to consider where software and hardware can be decoupled, which drives the remaining steps. Returning to a sketch of how a router is built can be helpful here; Figure 25-11 illustrates.
笔记
Note
图 25-11所示的图表非常难以绘制,因为构建软件的方法有很多种。这里显示的是一种可能的表示,只是为了说明构建路由器(或其他网络设备)所需的各个部分。
The kind of diagram shown in Figure 25-11 is notoriously difficult to draw, simply because there are so many different ways of building software. What is shown here is one possible representation just to illustrate the various pieces required to build a router (or other network device).
在图 25-11中:
In Figure 25-11:
• 转发专用集成电路(ASIC)、风扇/LED/等以及PHY(物理网络接口芯片组)是唯一显示的硬件设备;其余的是软件组件。
• The forwarding Application-Specific Integrated Circuit (ASIC), fans/LEDs/etc., and PHY (physical network interface chipset) are the only hardware devices shown; the remainder are software components.
• 假定软件组件在某些本地处理器、内存和存储资源上运行;在考虑网络设备的体系结构时,这些通常不会显示。
• The software components are assumed to run on some local processor, memory, and storage resources; these are not normally shown when considering the architecture of a network device.
• 路由堆栈由两个组件组成:实际路由协议(或其他控制平面)应用程序和路由信息库(RIB)。
• The routing stack consists of two components: the actual routing protocol (or other control plane) applications and the Routing Information Base (RIB).
• 内核主要管理进程,包括内存和处理器的使用;内核还可以在某些组件对之间提供通信通道。
• The kernel primarily manages processes, including memory and processor usage; the kernel may also provide a communication channel between some pairs of components.
• 系统中可能没有、也可能有一条或两条数据总线。如果此组件存在,则它负责提供在系统中其他组件之间传输信息的标准方式,并可能充当系统状态的数据存储。数据总线可以被实现为数据库或发布/订阅系统。
• There may be no, one, or two data busses in the system. If this component exists, it is responsible for providing a standard way of carrying information between the other components in the system, and potentially acting as a data store for system state. The data bus can be implemented as a database or a publish/subscribe system.
• 可能有也可能没有配置数据库。如果存在,配置数据库负责保存设备上所有其他系统的配置状态。
• There may or may not be a configuration database. If it exists, the configuration database is responsible for holding configuration state for all the other systems on the device.
• 配置系统提供了一些读取和写入设备配置的方法。一般来说,这将包括机器可读(API)和人类可读界面(命令行界面 [CLI])。机器可读接口将在第 26 章“网络自动化案例”中更全面地考虑。
• The configuration system provides some way to read and write the configuration of the device. Generally, this will include both a machine-readable (an API) and human-readable interface (a command-line interface [CLI]). The machine-readable interface will be considered more fully in Chapter 26, “The Case for Network Automation.”
有一个术语,网络操作系统(NOS),通常用于描述
There is a single term, the Network Operating System (NOS), that is often used to describe either
• 所有软件组件
• All of the software components
• 内核、数据总线和(有时)其他组件,例如 HAL 和 PAL
• The kernel, data bus, and (sometimes) other components, such as the HAL and PAL
由于 NOS 的含义是可变的,因此您需要确保您准确理解使用该术语时包含哪些组件,以及不包含哪些组件。
Because the meaning of NOS is variable, you need to make certain you understand precisely which components are being included when the term is used, and which are not.
给定这组组件,有趣的分解问题就变成了:哪些组件可以或应该由不同的人分离、开发、拥有等?有几个不同的逻辑位置可以放置这样的划分:
Given this set of components, the interesting disaggregation question becomes: which components can, or should, be split off and developed, owned, etc., by different people? There are several different logical places to place such a divide:
• 路由协议和RIB 之间
• Between the routing protocols and the RIB
• HAL 和其余组件之间
• Between the HAL and the rest of the components
• 在PAL 和其余组件之间
• Between the PAL and the rest of the components
• 硬件和软件之间
• Between the hardware and the software
• RIB 和数据总线之间
• Between the RIB and the data bus
• 配置系统和配置数据库之间
• Between the configuration system and the configuration database
• 配置数据库和数据总线之间
• Between the configuration database and the data bus
传统上,所有这些组件都是作为单个物品(设备)购买的。为了分解网络,您希望能够将该设备分成多个部分。在分解模型中,操作员可以
Traditionally, all of these components are purchased as a single item—an appliance. To disaggregate the network, you want to be able to break this appliance apart into multiple pieces. It is possible, in a disaggregated model, for an operator to
• 从一个供应商处购买硬件、HAL 和 PAL,构建自己的控制平面以在开源或供应商提供的 RIB 之上运行,并从另一供应商处购买系统的其余部分(可能称为 NOS)
• Purchase the hardware, HAL, and PAL from one vendor, build his own control plane to run on top of an open source or vendor-provided RIB, and purchase the remainder of the system, which might be called the NOS, from another vendor
• 从一个供应商处购买硬件,从另一供应商处购买软件(整个软件部分可能称为 NOS)
• Purchase the hardware from one vendor and the software from another (where the entire software piece may be called the NOS)
• 从单一供应商处购买除系统配置系统之外的所有内容并构建自己的配置系统
• Purchase everything except the system configuration system from a single vendor and build his own configuration system
关键点是运营商必须选择他想要拥有、想要购买哪些网络部分,以及哪种分解模型对其业务最有意义。选择的具体模型将取决于特定环境中的业务驱动因素和要求。通过将网络中的硬件与软件分离可以实现一些特定的成本优势,包括
The key point is the operator must choose which pieces of the network he wants to own, which he wants to purchase, and which disaggregation model makes the most sense for his business. The specific model chosen is going to depend on the business drivers and requirements in a particular environment. Some specific cost advantages that can be realized by disaggregating hardware from software in the network include
• 通过将硬件与软件分离来实现硬件商品化。如果一套软件可以跨多个硬件平台使用,那么硬件功能和成本就会成为驱动因素,而不是外部的品牌以及与硬件捆绑的软件。这是网络运营商之间白盒运动的主要目标。
• Commoditizing hardware by separating it from the software. If a suite of soft-ware can be used across multiple hardware platforms, the hardware capabilities and cost become driving factors, rather than the brand on the outside, and the software bundled with the hardware. This is the primary goal of the white box movement among network operators.
• 通过多代硬件提供运行稳定性。如果软件和硬件可以单独替换或修改,那么硬件就可以用更新、功能更强的设备替换,而无需修改操作流程和节奏。同时,软件可以随着时间的推移进行修改,而无需更换硬件,从而使网络能够成长和成熟,而无需使用叉车一次更换所有设备。
• Providing operational stability through many generations of hardware. If the software and hardware can be replaced or modified separately, then the hardware can be replaced with newer, more capable devices without modifying operational processes and cadence. At the same time, the software can be modified over time without replacing the hardware, allowing for the network to grow and mature without resorting to using a forklift to replace all the equipment at once.
与计算和应用程序方面的分解一样,网络空间中的分解远远超出了节省成本的范围。将软件与硬件解耦允许软件专门围绕应用程序架构构建 - 请记住,分解的应用程序将整个网络视为单个“事物”。就像构建高性能计算机需要调整和调整硬件以支持手头的特定计算任务一样,构建高性能分布式应用程序通常需要构建一个网络作为平台,针对应用程序进行调整以提高性能并集中精力运营商将时间投入到投资回报率更高的领域。
Like disaggregation on the compute and applications front, disaggregation in the network space goes far beyond cost savings. Decoupling the software from the hardware allows the software to be built specifically around the application architecture—remember that disaggregated applications treat the entire network as a single “thing.” Just like building a high-performance computer requires tuning and adjusting the hardware to support the specific computing task at hand, building a high-performance distributed application often requires building a network as a platform, tuned to the application to increase performance and focus where the operator spends time into areas with higher returns on investment.
笔记
Note
网络工程师不倾向于考虑删除功能,而是添加它们,作为调整最佳性能的方法。当你建造一辆赛车时,你并不是从添加更大的发动机开始;而是从添加更大的发动机开始。你首先要去掉不必要的东西的重量。这简化了问题集,减少了所需部件的数量,并且通常使更换或改装剩余部件变得更加简单,并使汽车本身更易于维护。对于网络工程师来说,养成思考什么可以删除以及什么可以添加的习惯至关重要。
Network engineers do not tend to think about removing features, rather than adding them, as a method of tuning for optimal performance. When you build a race car, you do not start by adding a bigger engine; you start by removing the weight of unnecessary things. This simplifies the problem set, reduces the number of components needed, and generally makes replacing or refitting the remaining parts a lot simpler, as well as making the car itself simpler to maintain. It is critical for network engineers to get into the habit of thinking about what can be removed, as well as what can be added.
结果是两个前端增益。一方面,硬件商品化,降低了成本。另一方面,软件是定制的,提供更大的价值,并允许软件随着业务的节奏而变化。即使在完全支持的分解环境中(在撰写本文时可用),也可以将软件生命周期与硬件生命周期断开。这样可以更快地更换硬件,以获得新的速度和馈送以及新的交换功能,同时使软件保持更长的周期,从而允许业务流程调整和解决软件问题。
The result is a two-front gain. On one side, hardware is commoditized, driving the cost down. On the other side, the software is customized, providing greater value, and allowing the software to move at the pace of the business. Even in a fully supported disaggregated environment (which are available at the time of this writing), it is possible to disconnect the software life cycle from the hardware life cycle. This allows for hardware replacement on a much faster schedule to gain new speeds and feeds, and new switching features, while keeping software in place for a longer cycle, allowing business processes to adjust and work around the software.
网络世界正在迅速变化,而分解在推动这些变化中发挥了重要作用。网络本身最终会以同样的方式、出于同样的原因被分解吗?最后一章,即第 30 章“展望未来”,将着眼于网络的未来并尝试回答这些问题。
The network world is changing rapidly, and disaggregation has played a major role in driving these changes. Will the network, itself, eventually be disaggregated in the same way and for the same reasons? The final chapter, Chapter 30, “Looking Forward,” will take a look at the future of networking and try to answer these questions.
计算和应用程序空间的分解推动了 IT 和网络管理领域的更多变化。例如,网络中分解的另一种形式是将网络本身提供的服务与网络设备分开。
Disaggregation in the compute and application space have driven many more changes in the world of IT and in network management. For instance, another form of disaggregation in the network is to divide the services offered by the network itself from the network appliance.
丘吉尔,伊丽莎白·F。“拼凑生活、橡皮鸭调试和混沌猴子”。互动22,没有。3(2015 年 4 月):22-23。doi:10.1145/2752126。
Churchill, Elizabeth F. “Patchwork Living, Rubber Duck Debugging, and the Chaos Monkey.” Interactions 22, no. 3 (April 2015): 22–23. doi:10.1145/2752126.
“旨在提高大规模 CLOS 网络中应用程序性能的工程大象流。” 加利福尼亚州欧文:Broadcom,2014 年。https: //docs.broadcom.com/docs/1211168569445 ?eula= true 。
“Engineered Elephant Flows for Boosting Application Performance in Large-Scale CLOS Networks.” Irvine, CA: Broadcom, 2014. https://docs.broadcom.com/docs/1211168569445?eula=true.
吉尔、菲利帕、纳文杜·贾恩和纳奇·纳加潘。了解数据中心的网络故障:测量、分析和影响。ACM,2011。https://www.microsoft.com/en-us/research/publication/understanding-network-failures-data-centers-measurement-analysis-implications/。
Gill, Phillipa, Navendu Jain, and Nachi Nagappan. Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications. ACM, 2011. https://www.microsoft.com/en-us/research/publication/understanding-network-failures-data-centers-measurement-analysis-implications/.
拉普霍夫,彼得。“大型数据中心的路由设计。” 加拿大不列颠哥伦比亚省,2012 年 6 月 3 日。https: //www.nanog.org/meetings/nanog55/presentations/Monday/Lapukhov.pdf。
Lapukhov, Petr. “Routing Design for Large Scale Data Centers.” British Columbia, Canada, June 3, 2012. https://www.nanog.org/meetings/nanog55/presentations/Monday/Lapukhov.pdf.
拉普霍夫、彼得、阿里夫·普莱姆吉和乔恩·米切尔。在大型数据中心中使用 BGP 进行路由。征求意见 7938。RFC 编辑,2016。doi:10.17487/RFC7938。
Lapukhov, Petr, Ariff Premji, and Jon Mitchell. Use of BGP for Routing in Large-Scale Data Centers. Request for Comments 7938. RFC Editor, 2016. doi:10.17487/RFC7938.
马丁·卡萨多和贾斯汀·佩蒂特。“老鼠和大象。” 网络异端,2013 年 11 月 1 日。https: //networkheresy.com/2013/11/01/of-mice-and-elephants/。
Martin Casado, and Justin Pettit. “Of Mice and Elephants.” Network Heresy, November 1, 2013. https://networkheresy.com/2013/11/01/of-mice-and-elephants/.
佩佩尔尼亚克、伊万. 数据中心设计案例研究。ipspace.net,2014。http: //www.ipspace.net/Data_Center_Design_Case_Studies。
Pepelnjak, Ivan. Data Center Design Case Studies. ipspace.net, 2014. http://www.ipspace.net/Data_Center_Design_Case_Studies.
罗伊、阿琼、曾宏毅、Jasmeet Bagga 和 Alex C. Snoeren。“被动实时数据中心故障检测和定位。” 第14 届 USENIX 网络系统设计与实现研讨会 (NSDI 17),595-612。波士顿,MA:USENIX 协会,2017 年。https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/roy。
Roy, Arjun, Hongyi Zeng, Jasmeet Bagga, and Alex C. Snoeren. “Passive Realtime Datacenter Fault Detection and Localization.” In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), 595–612. Boston, MA: USENIX Association, 2017. https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/roy.
Singh、Arjun、Joon Ong、Amit Agarwal、Glen Anderson、Ashby Armistead、Roy Bannon、Seb Boving 等。“木星崛起:谷歌数据中心网络十年的 Clos 拓扑和集中控制。” 2015 年 ACM 数据通信特别兴趣小组会议记录,183-197。SIGCOMM '15。纽约,纽约:ACM,2015。doi:10.1145/2785956.2787508。
Singh, Arjun, Joon Ong, Amit Agarwal, Glen Anderson, Ashby Armistead, Roy Bannon, Seb Boving, et al. “Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network.” In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication, 183–197. SIGCOMM ’15. New York, NY: ACM, 2015. doi:10.1145/2785956.2787508.
怀特、拉斯. “开源路由器的现状。” 于 2017 年 6 月 7 日在华盛顿州贝尔维尤北美网络运营商集团上发布。https://www.youtube.com/watch?v=JTQqmnVRToI。
White, Russ. “The State of Open Source Routers.” Presented at the North American Network Operators Group, Bellevue, WA, June 7, 2017. https://www.youtube.com/watch?v=JTQqmnVRToI.
怀特、拉斯和丹尼斯·多诺霍。网络架构的艺术:业务驱动的设计。第一版。印第安纳州印第安纳波利斯:思科出版社,2014 年。
White, Russ, and Denise Donohue. The Art of Network Architecture: Business-Driven Design. 1st edition. Indianapolis, IN: Cisco Press, 2014.
1.研究虚拟机和容器的区别。提供两者之间三到四个差异的简短列表。
1. Research the difference between a virtual machine and a container. Provide a short list of three or four differences between the two.
2.研究环形织物设计。它与更广泛使用的脊柱和叶子设计有何不同?环形线圈设计对超额认购有何影响?
2. Research the toroid fabric design. How does it differ from the more widely used spine and leaf design? What would be the impact of a toroid design on oversubscription?
3. 解释“普通”网络和结构之间的差异。
3. Explain the differences between a “normal” network and a fabric.
4. 在非阻塞设计中,阻塞问题并没有真正消除,而是转移了。它移到哪里了?
4. The problem of blocking is not truly removed in nonblocking designs but rather moved. Where is it moved to?
5. 解释大象流和老鼠流的区别。
5. Explain the difference between elephant and mouse flows.
6. 找到两个可用于网络操作系统的硬件抽象层。请注意这两个抽象层之间的四个区别。
6. Find two hardware abstraction layers available for network operating systems. Note four differences between these two abstraction layers.
7.找到两个开源的路由协议栈。支持哪些协议?这些项目似乎获得了多少支持?
7. Find two open source routing protocol stacks. What protocols are supported? How much support does each of these projects appear to receive?
典型的网络由一组分布式节点组成,每个节点都运行一个操作系统,每个节点都配置有协议和功能集。网络范围的功能(例如路由协议)需要同步配置才能使节点能够协同工作。除了分布式配置之外,节点数量也导致了复杂性。新节点或新功能增加了系统的复杂性,而增加的复杂性增加了失败的机会,增加了运营成本,并阻碍了网络的变革能力。这些货币和非货币成本通常限制网络工程师采用新功能、使用网络解决业务问题或了解网络故障。
A typical network consists of a collection of distributed nodes, each running an operating system, each configured with protocols and feature sets. Networkwide features—for example a routing protocol—require synced configurations to enable nodes to work together. The number of nodes in addition to the distributed configurations leads to complexity. New nodes or new features increase the complexities of the system, and increased complexity increases the opportunity of failure, increases operational cost, and retards the network’s ability to change. These monetary and nonmonetary costs often restrict network engineers from adopting new features, using the network to solve business problems, or understanding network failures.
网络自动化可以更好地部署、操作和排除网络故障。可以降低网络部署、配置的复杂性,和运营;提高敏捷性,即运营商更快地重塑网络以满足新要求的能力;通过消除人为因素并使用自动化工具与各个网络设备交互来降低成本和风险。
Network automation can lead to better deployment, operation, and troubleshooting of the network. It can reduce the complexity of network deployment, configurations, and operations; increase agility, or the ability of the operator to reshape the network to new requirements more quickly; and reduce cost and risk by removing the human element and using automation tools to interact with individual network devices.
网络自动化可以像在数据中心自动配置新交换机一样简单,通过动态软件开发来更改配置,以自动响应系统日志事件。更强大的实施使网络团队不再将网络视为单个网络设备的集合,而是开始将网络视为一个系统。
Network automation can be as simple as automatically provisioning new switches in the data center to changing configurations through dynamic software development to automating response to Syslog events. More robust implementations enable network teams to stop thinking about the network as a collection of individual network devices and start thinking about the network as a system.
网络自动化在网络团队中催生了一个新的头衔:自动化工程师。自动化工程师通常属于运营团队,需要先进的网络、协议和故障排除技能;熟练掌握 Python 或 Bash 等脚本语言;以及操作文本的能力,例如使用正则表达式。网络自动化团队通常很小。
Network automation has spawned a new title within network teams: the automation engineer. Automation engineers, usually part of the operations team, require advanced network, protocol, and troubleshooting skills; proficiency in a scripting language such as Python or Bash; and the ability to manipulate text, such as using regular expressions. Network automation teams are usually very small.
笔记
Note
正则表达式是一种匹配较大文本文件(例如网络设备配置,甚至书籍)中的文本字符串的方法,可以查找这些字符串,也可以使用文本处理器将一个字符串替换为另一个字符串。有关正则表达式的格式和使用的更多信息可以在本章末尾的“进一步阅读”部分中找到。
Regular expressions are a way to match on text strings within a larger text file (such as a network device configuration, or even a book), either to find those strings, or to use a text processor to replace one string with another. More information about the formatting and uses of regular expressions can be found in the “Further Reading” section at the end of the chapter.
为了实现网络设备的自动化,网络自动化工具需要某种方法来连接、验证网络节点的管理平面并与之交互。传统上,大多数(如果不是全部)网络设备都具有命令行界面 (CLI)。CLI 提供通过 Telnet 或 Secure Shell (SSH) 对管理平面的访问,从而创建众所周知的人机系统。虽然 CLI 界面针对人类进行了优化,但可以使用 Expect、Puppet、Ansible、Chef、Salt 和 CFEngine 等工具实现自动化(有关这些工具的信息的链接,请参阅本章末尾的“进一步阅读”部分) 。
To automate a network device, a network automation tool requires some method to connect, authenticate, and interact with the management plane of a network node. Traditionally, most if not all network devices feature a command-line interface (CLI). The CLI provides access to the management plane over Telnet or a Secure Shell (SSH), creating what’s well known as a human-to-machine system. While CLI interfaces are optimized for humans, they can be automated using tools such as Expect, Puppet, Ansible, Chef, Salt, and CFEngine (see the “Further Reading” section at the end of the chapter for links to information about these tools).
以其中一个工具为例:Expect 是一种脚本语言,可通过交互式界面(例如,通过 SSH 运行的 CLI 交互)自动进行配置。Expect 创建了一个机器对人对机器的系统,通常使用 CLI 作为 API(因此 Expect 可以用于任何具有 CLI 的系统)。期望脚本在 SSH 会话返回一些文本后运行一组命令。例如,用于登录的 Expect 脚本可能需要执行以下操作:
Taking one of these tools as an example: Expect is a scripting language to automate configurations through interactive interfaces—for example, CLI interaction running over SSH. Expect creates a machine-to-human-to-machine system, normally using the CLI as an API (so Expect can be leveraged for any system with a CLI). Expect scripts run a set of commands after the SSH session returns some text. For example, an Expect script to log in may entail the following:
期待“用户名:”
发送“Groot”
期待“密码:”
发送“cisco123”
expect "Username: "
send "Groot"
expect "Password: "
send "cisco123"
Expect 处理器实时检查文本流,查找“用户名:”提示(通常通过在传入文本流上使用某种形式的正则表达式匹配引擎)。当遇到此提示时,处理器会根据先前编写的脚本发送包含“Groot”的文本字符串作为回复。密码也遵循相同的模式。对于 Cisco CLI,Expect 可能仅响应特定级别的命令提示符,例如启用迅速的。Expect 具有很强的可扩展性,可以在单个设备或整个网络上自动化任何基于 CLI 的功能。在大多数情况下,网络自动化管理员将 Expect 与文本解析和操作工具(例如正则表达式)结合使用,以自动配置大量设备,从而用少量脚本管理整个网络。
The Expect processor examines the text stream in real time, looking for the “User-name:” prompt (generally by using some form of regular expression matching engine on the incoming text stream). When this prompt is encountered, the processor sends a text string in reply containing “Groot” based on a previously written script. The same pattern is followed for the password. In the case of a Cisco CLI, Expect may just respond to a command prompt at a particular level, such as the enable prompt. Expect is very extensible and can automate any CLI-based feature on a single device or an entire network. In most cases a network automation administrator will use Expect with text parsing and manipulation tools (such as regular expressions) to automate the configuration of a large number of devices, hence managing the entire network with a small number of scripts.
虽然 CLI 脚本工具已被证明非常成功并且仍在现代网络中使用,但它们非常难以构建、维护和故障排除。Expect 的一个常见问题是处理意外发生时发生的情况。如果供应商更改了特定命令或提示的输出,或者需要输入命令的顺序,则需要发现更改并针对新的输入/输出模式修改受影响的脚本。例如,供应商可能会将“用户名”更改为“用户名”,甚至首先要求输入密码。在这些情况下,脚本将根本不会运行或抛出错误。此外,由于每个供应商的 CLI 略有不同,因此必须在多供应商或多网络操作系统环境中重复编写脚本工作。最后,Expect 对配置状态没有任何隐含的理解;因此这个逻辑必须写在脚本中。
While CLI scripting tools have proven to be very successful and are still in use in modern networks, they are very difficult to build, maintain, and troubleshoot. A common issue with Expect is dealing with what happens when something unexpected happens. If a vendor changes the output of a particular command or prompt, or the order in which commands need to be entered, the change will need to be discovered and the affected scripts modified for the new input/output pattern. For example, a vendor might change “Username” to “username,” or even ask for the password first. In these cases, the script will simply not run or throw an error. Additionally, because each vendor has slight variations of CLI, scripting work must be duplicated in multivendor or multinetwork operating system environments. Finally, Expect does not have any implicit understanding of configuration state; thus this logic must be written in the script.
简单网络管理协议 (SNMP) 是最早出现的用于更好地管理和自动化网络的工具之一。SNMP 使网络操作员能够安全地连接到设备并使用通用(标准化)或特定于供应商的管理信息库 (MIB) 与其进行交互。SNMP 最初是为监控和配置管理而设计的;但是,使用 SNMP配置管理的采用率极低,因为它非常难以使用,并且通常不能反映节点的所有功能。
One of the first tools to emerge to better manage and automate networks was the Simple Network Management Protocol (SNMP). SNMP enables network operators to securely connect to a device and use a common (standardized) or vendor- specific Management Information Base (MIB) to interact with it. SNMP was originally designed for both monitoring and configuration management; however, using SNMP for configuration management has extremely low adoption because it is very difficult to use and usually does not reflect all the capabilities of a node.
用字典术语来说,SNMP 将元数据存储在 MIB 规范中,而相应的状态存储在从设备检索的信息中。为了检索某条特定信息,必须根据字典规则指定所请求的信息;例如,一个请求可能看起来像
SNMP, in dictionary terms, stores the metadata in an MIB specification, while the corresponding state is stored in the information retrieved from the device. In order to retrieve a particular piece of information, the information requested must be specified according to the dictionary rules; for example, a request might look like
snmpget -m ../../mibs/RFC1213-MIB 本地主机 .iso.org.dod。
internet.mgmt.mib-2.system.sysDescr.0
snmpget -m ../../mibs/RFC1213-MIB localhost .iso.org.dod.
internet.mgmt.mib-2.system.sysDescr.0
为了检索有关整个子系统的信息,必须逐项“遍历”整个 MIB 表,并单独返回每个项目。然后,必须将单独的项目重新组装成正确的形式,然后根据 MIB 定义进行解释,以了解设备的实际状态。需要一种新的方法来实现网络自动化。
In order to retrieve information about an entire subsystem, the entire MIB table must be “walked,” item to item, with each item being returned separately. The separate items must then be reassembled into their proper form and then interpreted based on the MIB definition to understand what the actual state of the device is. A new method to automate networks was required.
随着网络变得越来越大,对于应用程序交付越来越重要,网络运营商寻求更好的方法来管理和自动化这些网络。2002 年,一小群网络工程师举办了互联网工程任务组 (IETF) 研讨会,讨论网络管理的未来并为未来协议开发构建高级架构。本次研讨会的结果记录在 RFC3535“ 2002 IAB 网络管理研讨会概述”中。1 RFC3535讨论了当前的网络管理技术,包括SNMP和CLI,更重要的是描述了网络管理和自动化协议的14项要求。这些要求是
As networks became bigger and more important to application delivery, network operators sought better ways to manage and automate those networks. In 2002 a small group of network engineers held an Internet Engineering Task Force (IETF) workshop to discuss the future of network management and build a high-level architecture for future protocol developments. The results of this workshop are documented in RFC3535, Overview of the 2002 IAB Network Management Workshop.1 RFC3535 discusses current network management technologies, including SNMP and CLI, and more importantly describes 14 requirements for network management and automation protocols. These requirements are
1. 从运营商的角度来看,易用性是任何网络管理技术的关键要求。
1. Ease of use is a key requirement for any network management technology from the operator’s point of view.
2. 有必要明确区分配置数据和描述操作状态和统计数据的数据。有些设备很难确定哪些参数是通过管理配置的,哪些是通过其他机制(例如路由协议)获得的。
2. It is necessary to make a clear distinction between configuration data and data describing operational state and statistics. Some devices make it very hard to determine which parameters were administratively configured and which were obtained via other mechanisms such as routing protocols.
3. 要求能够从设备中单独获取配置数据、操作状态数据和统计数据,并能够在设备之间进行比较。
3. It is required to be able to fetch separately configuration data, operational state data, and statistics from devices, and to be able to compare these between devices.
4. 需要使运营商能够专注于网络整体的配置,而不是单个设备的配置。
4. It is necessary to enable operators to concentrate on the configuration of the network as a whole rather than individual devices.
5. Support for configuration transactions across a number of devices would significantly simplify network configuration management.
6. 给定配置 A 和配置 B,应该可以生成从 A 到 B 所需的操作,同时对网络和系统的状态变化和影响最小。尽量减少配置更改造成的影响非常重要。
6. Given configuration A and configuration B, it should be possible to generate the operations necessary to get from A to B with minimal state changes and effects on network and systems. It is important to minimize the impact caused by configuration changes.
7. 转储和恢复配置的机制是操作员需要的原始操作。从设备拉动配置和向设备推送配置的标准是可取的。
7. A mechanism to dump and restore configurations is a primitive operation needed by operators. Standards for pulling and pushing configurations from and to devices are desirable.
8. 必须能够轻松地对一段时间内的配置以及链路两端之间的配置进行一致性检查,以便确定两个配置之间的变化以及这些配置是否一致。
8. It must be easy to do consistency checks of configurations over time and between the ends of a link in order to determine the changes between two configurations and whether those configurations are consistent.
9. 网络范围的配置通常存储在中央主数据库中,并通过生成 CLI 命令序列或将完整的配置文件推送到设备,转换为可以推送到设备的格式。尽管各个运营商使用的模型可能非常相似,但网络配置没有通用的数据库模式。需要提取、记录和标准化这些网络范围配置数据库模式的公共部分。
9. Networkwide configurations are typically stored in central master databases and transformed into formats that can be pushed to devices, either by generating sequences of CLI commands or by pushing complete configuration files to devices. There is no common database schema for network configuration, although the models used by various operators are probably very similar. It is desirable to extract, document, and standardize the common parts of these networkwide configuration database schemas.
10. 非常希望能够使用 diff 等文本处理工具和 RCS 或 CVS 等版本管理工具来处理配置,这意味着设备不应任意重新排序访问控制列表等数据。
10. It is highly desirable for text processing tools such as diff and version management tools such as RCS or CVS to able to be used to process configurations, which implies devices should not arbitrarily reorder data such as access control lists.
11. 管理界面所需的访问控制粒度需要与操作需求相匹配。典型的要求是基于角色的访问控制模型和最小权限原则,其中可以向用户授予执行所需任务所需的最小访问权限。
11. The granularity of access control needed on management interfaces needs to match operational needs. Typical requirements are a role-based access control model and the principle of least privilege, where a user can be given the minimum access necessary to perform a required task.
12. 必须能够对跨设备的访问控制列表进行一致性检查。
12. It must be possible to do consistency checks of access control lists across devices.
13. 区分配置的分发和某个配置的激活很重要。设备应该能够容纳多种配置。
13. It is important to distinguish between the distribution of configurations and the activation of a certain configuration. Devices should be able to hold multiple configurations.
14. SNMP访问控制是面向数据的,而CLI访问控制通常是面向命令(任务)的。根据管理功能,有时面向数据或面向任务的访问控制更有意义。因此,需要支持面向数据和面向任务的访问控制。
14. SNMP access control is data oriented, while CLI access control is usually command (task) oriented. Depending on the management function, sometimes data-oriented or task-oriented access control makes more sense. As such, it is a requirement to support both data-oriented and task-oriented access control.
为了响应 RFC3535,开发了现代自动化协议。这些协议通过开放标准和方法启用机器对机器接口,从而实现更好的多供应商网络管理和自动化。
In response to RFC3535, modern automation protocols were developed. These protocols enable better multivendor network management and automation by enabling machine-to-machine interfaces through open standards and methods.
NETCONF是IETF响应RFC3535而开发的;它是一种开放标准协议,可在单一或多供应商网络中进行设备配置和监控。NETCONF以客户端/服务器模型工作,其中服务器是网络节点,客户端是独立的网络管理站。管理站提供网络的整体管理并支持网络范围的自动化,允许网络管理员将网络作为单个实体进行处理。
NETCONF was developed by the IETF in response to RFC3535; it is an open standard protocol enabling device configuration and monitoring in either single or multivendor networks. NETCONF works in a client/server model where the server is the network node, and the client is a standalone network management station. The management station provides holistic management of the network and supports networkwide automation, allowing network administrators to address the network as a single entity.
NETCONF 具有多个配置数据存储来密切反映网络设备的操作状态,如表 26-1所示。
NETCONF features multiple configuration data stores to closely mirror the operational state of a network device, as shown in Table 26-1.
Table 26-1 NETCONF Data Stores
数据存储 Data Store |
目的 Purpose |
<候选人> <candidate> |
用于验证和测试的配置的工作副本 Working copy of the configuration for validation and testing |
<运行> <running> |
设备当前使用的配置 The configuration the device is currently using |
<启动> <startup> |
设备启动时将运行的配置 The configuration the device will run when booted |
表 26-1显示了三个数据存储(或表),它们代表了更新网络设备上的配置的标准过程:
Table 26-1 shows three data stores (or tables), which represent a standard process for updating configurations on network devices:
• <running> 数据存储是当前正在使用或在设备上运行的配置。
• The <running> data store is the configuration currently being used, or run on the device.
• <startup> 配置是设备下次启动时运行的配置。
• The <startup> configuration is what will be run the next time the device boots.
• 候选配置存储在<candidate> 数据存储中,并且可以在不影响正在运行的配置的情况下对其进行操作。当网络运营商完成候选配置后,可以根据一组规则验证该配置的语法是否正确,以确保没有遗漏任何配置项或通过试运行发送。例如,验证器可能会检查以确保接口配置始终包含 IPv4 和 IPv6 地址,或者为 IPv4 和 IPv6 配置路由协议,或者仅配置 IPv6。这种验证可以发现并防止简单的错误。
• Candidate configurations are stored in the <candidate> data store and can be manipulated without impacting the running configuration. When a network operator is finished with a candidate configuration, the configuration can be validated for proper syntax, against a set of rules to ensure no configuration items have been missed, or sent through a dry run. For instance, a validator might check to make certain an interface configuration always includes IPv4 and IPv6 addresses, or a routing protocol is configured for both IPv4 and IPv6, or just IPv6. This kind of validation can catch and prevent simple mistakes.
如果配置可接受,则将其提交或推送到运行配置中。候选配置通常用于在中断或更改窗口期间安排提交。这允许在更改窗口之前写入和测试更改。由于某些设备不支持候选配置,因此 NETCONF 具有与初始 HELLO 消息进行功能交换的功能。NETCONF 配置是原子的:如果配置的任何部分失败或出现意外结果,整个配置都可以回滚。
If the configuration is acceptable, it is then committed, or pushed into running configuration. Candidate configurations are often leveraged to schedule commits during outage or change windows. This allows changes to be written and tested before the change window. Because some devices do not support a candidate configuration, NETCONF features a capability exchange with initial HELLO messages. NETCONF configurations are atomic: if any part of the configuration fails or has unexpected results, the entire configuration can be rolled back.
如果运行配置被证明是正确的,则可以将其提交到 <startup> 数据存储中,以便设备下次重新启动时将使用此配置启动。设备还可以通过非常简单的配置启动,然后由网络管理站通过 <candidate> 和 <running> 数据存储进行修改。
If the running configuration proves to be correct, then it can be committed to the <startup> data store, so the device will boot with this configuration the next time it is restarted. It is also possible for the device to boot with a very simple configuration, which is then modified by a network management station through the <candidate> and <running> data stores.
NETCONF的一个关键组件是管理站。管理站为网络管理和自动化提供了网络范围的视角。它将具有图形用户界面(GUI)或CLI界面,使网络管理员能够专注于整个网络服务的部署自动化。示例服务可能是配置新的 VPN 客户或更改 SNMP 密码。网络管理员可以使用管理站来管理服务的整个生命周期。默认情况下,由于 NETCONF,网络管理站支持多供应商网络。为了提供额外的可扩展性,一些网络管理站采用其他南向配置方法或协议(例如 SNMP 或 CLI)以及强大的北向 API,用于集成到其他系统中。
A key component of NETCONF is the management station. The management station provides a networkwide viewpoint for network management and automation. It will have a graphical user interface (GUI) or CLI interface, enabling network administrators to focus on the deployment automation of an entire networkwide service. A sample service may be provisioning a new VPN customer or changing an SNMP password. Network administrators can use a management station to manage the entire lifecycle of a service. By default, and because of NETCONF, network management stations support multivendor networks. To provide additional extensibility, some network management stations feature other southbound configuration methods or protocols—for example, SNMP or CLI—and robust northbound API for integration into other systems.
NETCONF是一个模块化协议,分层组织,如图26-1所示。
NETCONF is a modular protocol, organized in layers, as shown in Figure 26-1.
这些层允许插入其他工具或协议来扩展功能。
These layers allow for other tools or protocols to be inserted to extend functionality.
底层涉及设备之间的消息传输。NETCONF支持多种不同的传输协议;然而,SSH 很常用,因为它众所周知并提供身份验证、完整性和安全性保密。由于 SSH 在传输控制协议 (TCP) 上运行,因此它还提供可靠的传输。
The bottom layer is concerned with the transport of messages between devices. NETCONF supports many different transport protocols; however, SSH is commonly used because it is well known and provides authentication, integrity, and confidentiality. Because SSH runs over the Transmission Control Protocol (TCP), it also provides reliable transport.
消息层对远程过程调用 (RPC) 进行构建和编码。RPC 看起来是对调用应用程序的本地函数(或过程)调用,但实际上是在远程设备上执行的(有关 RPC 的更多信息,请参阅本章末尾的“进一步阅读”部分)。NETCONF 对 RPC 的使用使 NETCONF 能够指示远程设备如何处理该命令,例如,应用操作层中详细的配置。NETCONF RPC 消息以可扩展标记语言 (XML) 进行编码,并且必须包含允许 NETCONF 跟踪消息的 message-id 元素。最后,消息层支持通知,设备将配置更改通知管理站。
The message layer frames and encodes remote procedure calls (RPCs). An RPC appears to be a local function (or procedure) call to the calling application, but is actually executed on a remote device (see the “Further Reading” section at the end of the chapter for more information on RPCs). NETCONF’s use of RPC enables NETCONF to instruct the remote device what to do with the command—for example, apply the configuration detailed in the operations layer. NETCONF RPC messages are encoded in the eXtensible Markup Language (XML) and must contain a message-id element allowing NETCONF to track messages. Finally, the messages layer supports notifications, where devices notify the management station of a configuration change.
操作层定义 NETCONF 客户端和服务器的操作。NETCONF 操作是一组在数据存储上使用的创建、读取、更新和删除 (CRUD) 操作。常见操作包括获取配置、编辑配置和删除配置。
The operations layer defines the actions for NETCONF clients and servers. NETCONF operations are a set of create, read, update, and delete (CRUD) actions used on the data stores. Common operations include get-config, edit-config, and delete config.
基本协议包括表26-2所示的以下操作。
The base protocol includes the following operations shown in Table 26-2.
NETCONF操作 NETCONF Operations |
描述 Description |
得到 get |
检索运行配置和设备状态信息 Retrieve running configuration and device state information |
获取配置 get-config |
检索指定配置数据存储的全部或部分 Retrieve all or part of a specified configuration datastore |
编辑配置 edit-config |
将指定配置的全部或部分加载到指定的目标配置数据存储 Load all or part of a specified configuration to the specified target configuration datastore |
复制配置 copy-config |
创建整个配置数据存储或将其替换为另一个完整配置数据存储的内容 Create or replace an entire configuration datastore with the contents of another complete configuration datastore |
删除配置 delete-config |
删除配置数据存储 Delete a configuration datastore |
锁 lock |
锁定设备的整个配置数据存储系统 Lock the entire configuration datastore system of a device |
开锁 unlock |
释放配置锁 Release a configuration lock |
闭门会议 close-session |
正常终止 NETCONF 会话 Gracefully terminate the NETCONF session |
终止会话 kill-session |
强制终止 NETCONF 会话 Force the termination of a NETCONF session |
内容层包含发送至网络节点或从网络节点发送的格式化数据(配置或通知)。NETCONF 规范没有定义数据应该如何格式化;然而,它确实建议使用 YANG 数据建模语言。数据建模确保系统之间的兼容性。
The content layer contains the formatted data, either configuration or notification, sent to or from the network node. The NETCONF specification does not define how the data should be formatted; it does, however, suggest the use of the YANG data modeling language. Data modeling ensures compatibility between systems.
YANG是RFC6020中定义并在RFC7950中更新的数据建模语言;它用于格式化 NETCONF 操作层和内容层中的配置、通知和状态数据。数据建模是数据的语法和语义或模式的定义,在远程系统之间工作时至关重要。该数据模型保证了NETCONF管理站的请求在NETCONF服务器(网络设备)上得到忠实的执行。
YANG is a data modeling language defined in RFC6020 and updated in RFC7950; it is used to format configuration, notification, and state data in the operations and content layers of NETCONF. Data modeling is a definition of both the syntax and the semantics or schema of the data and is critical when working between remote systems. The data model ensures the requests of the NETCONF management station are faithfully carried out on the NETCONF server (network device).
有两套广泛部署和支持的 YANG 模型:
There are two sets of widely deployed and supported YANG models:
• IETF 在YANG 中为每个协议、基本路由功能和通用设备管理要求标准化了数据模型。
• The IETF standardizes a data model in YANG for each protocol, for basic routing functionality, and for common equipment management requirements.
• OpenConfig 小组维护另一组数据模型,大部分重叠,并且通常与 IETF 数据模型协调。
• The OpenConfig group maintains another set of data models, largely overlapping, and often coordinated with the IETF data models.
除了这些模型之外,每个供应商还支持特定于供应商和设备的模型集,可以(通常)从其支持站点下载这些模型集。
Beyond these models, each vendor also supports a vendor- and equipment-specific model set that can (often) be downloaded from their support sites.
YANG 的底层语法是键控分层模型中的 XML,非常类似于类型长度值 (TLV) 格式。模型的层次结构支持键/值配对数据的父/子关系的多个有组织级别。YANG 中的键在具有单个值或数据列表的层中必须是唯一的。YANG 数据是类型化的(例如整数、字符串等),并且由服务器和客户端实现强制执行。由于 YANG 数据采用 XML 格式,因此该数据通常是人类可读的。以下代码片段是显示“显示界面简介”的 YANG 数据示例:
The underlying syntax of YANG is XML in a keyed hierarchical model, much like a Type Length Value (TLV) format. The hierarchy of the model enables multiple organized levels of parent/child relationships of key/value paired data. The keys in YANG must be unique within a layer with single values or lists of data. YANG data is typed—for example, integer, string, etc.—and is enforced by the server and client implementations. Because YANG data is in XML, the data is generally human readable. The following code snippet is an example of YANG data showing “show interface brief”:
01 <?xml 版本=“1.0”?>
02 <nf:rpc xmlns:nf=“urn:ietf:params:xml:ns:netconf:base:1.0”
xmlns=“http://www.cisco.com/” nxos:7.0.3.I6.1.:if_manager"
message-id="1">
03 <nf:get>
04 <nf:filter type="subtree">
05 <显示>
06 <接口>
07 <简要/ >
08 </接口>
09 <
/显示> 10 </nf:filter>
11 </nf:get>
12 </nf:rpc>
13 ]]>]]>
01 <?xml version="1.0"?>
02 <nf:rpc xmlns:nf="urn:ietf:params:xml:ns:netconf:base:1.0"
xmlns="http://www.cisco.com/nxos:7.0.3.I6.1.:if_manager"
message-id="1">
03 <nf:get>
04 <nf:filter type="subtree">
05 <show>
06 <interface>
07 <brief/>
08 </interface>
09 </show>
10 </nf:filter>
11 </nf:get>
12 </nf:rpc>
13 ]]>]]>
在代码片段中,第 1 行将文档声明为 XML。第 2 行是来自标准 NETCONF 库的 RPC 调用。第3行是NETCONF操作。第5行到第7行是NETCONF内容和显示接口信息的命令。
In the code snippet, line 1 declares the document as XML. Line 2 is the RPC call from the standard NETCONF library. Line 3 is the NETCONF operation. Lines 5 through 7 are the NETCONF content and the command to show the interface information.
笔记
Note
RFC7951 定义了使用 YANG 建模的数据的 JavaScript 对象表示法 (JSON) 编码。XML 和 JSON 将在本章后面讨论。
RFC7951 defines JavaScript Object Notation (JSON) encoding of data modeled with YANG. Both XML and JSON are discussed later in this chapter.
NETCONF 和 YANG 共同提供了一个基于开放标准、易于使用的工具集来实现网络自动化。商业工具(例如 Cisco 的 Tail-F)利用 NETCONF/YANG,使网络管理员能够根据网络提供的服务(例如服务质量 (QoS) 或虚拟专用网络 (VPN))来管理网络。
Together NETCONF and YANG provide an open-standards-based, easy-to-use toolset to automate a network. Commercial tools (for example, Cisco’s Tail-F) leverage NETCONF/YANG and enable network administrators to manage the network in terms of the service the network provides—for example, Quality of Service (QoS) or Virtual Private Networks (VPNs).
RESTCONF 是 NETCONF 的新兴扩展,它利用安全套接字层(SSL,与 HTTPS 相结合)上广泛部署的超文本传输协议 (HTTP) 与网络设备进行交互。RESTCONF使用YANG作为数据建模语言,具有与NETCONF相同的基本功能;但是,它使用 POST、PUT 和 DELETE 等 HTTP 方法来实现 NETCONF 操作的等效功能。REST-CONF 中提供的 RESTful 方法可通过 HTTP 对数据存储层次结构进行基本的创建、读取、更新和删除 (CRUD) 操作。
RESTCONF is an emerging extension of NETCONF that leverages the widely deployed Hypertext Transfer Protocol (HTTP) over the Secure Sockets Layer (SSL, combined with HTTPS) to interact with network devices. RESTCONF uses YANG as a data modeling language and has the same basic functionality as NETCONF; however, it uses HTTP methods such as POST, PUT, and DELETE to implement the equivalent of NETCONF operations. The RESTful methods available in REST-CONF enable basic create, read, update, and delete (CRUD) operations on a hierarchy of data stores via HTTP.
RESTCONF中的“REST”是因为它提供了RESTful接口;具象状态传输 (REST) 和 RESTful 接口的概念将在下一节中进一步讨论。
The “REST” is in RESTCONF because it provides a RESTful interface; the concept of Representational State Transfer (REST) and RESTful interfaces is discussed further in the next section.
一些现代网络操作系统(例如 Cisco NX-OS)提供额外的特定于供应商的自动化编程接口。这些接口称为应用程序编程接口 (API),在标准 TCP 端口上公开,需要使用 Python 等高级语言才能与设备交互。更高级别的程序包含特定的交互(例如配置或故障排除例程),并利用 API 连接到设备。最多网络操作系统需要通过 CLI 启用 API 并需要身份验证。
Some modern network operating systems, for example, Cisco NX-OS, offer additional vendor-specific automation programmatic interfaces. These interfaces, known as Application Programming Interfaces, or APIs, are exposed on standard TCP ports and require a higher-level language such as Python to interact with a device. The higher-level program contains the specific interaction—for example, a configuration or troubleshooting routine—and leverages the API to connect to the device. Most network operating systems require the API to be enabled via CLI and require authentication.
网络 API 不是在标准机构的指导下创建或维护的,每个供应商甚至同一供应商的不同平台都会有独特的方法和程序来访问和使用 API。网络 API 的文档可以在供应商的网站上找到。
Network APIs are not created or maintained under the guidance of a standards body, and each vendor or even different platforms from the same vendor will have unique methods and procedures to access and use the APIs. Documentation for a network API is found on the vendor’s website.
使用 API 的网络自动化为更高级别的应用程序提供了与网络设备交互的能力。更高级别的程序引入诸如 IF…THEN…ELSE 之类的逻辑或将网络与业务级系统或过程联系起来的按需配置更改。具有 API 的自定义应用程序非常灵活;然而,它们通常会带来复杂性和可支持性方面的成本。每次供应商更改自定义 API 时,都必须执行一些工作来调整使用这些 API 的应用程序。
Network automation with APIs offers higher-level applications the capability to interact with a network device. Higher-level programs introduce logic such as IF…THEN…ELSE or on-demand configuration changes that tie the network into business-level systems or procedures. Custom applications with APIs are extremely flexible; however, they often come with a cost in complexity and supportability. Each time a vendor changes a custom API, some amount of work must be performed to adapt applications using those APIs.
使用网络 API 的一种流行方法是将网络连接到私有云环境中的更高级别的自动化工具。私有云环境支持自动配置和拆卸网络、存储和计算,以降低成本并提高数据中心基础设施的敏捷性。自动化网络、存储和计算的组合通常称为基础设施即服务,是其他云服务的起点,例如平台即服务 (PaaS) 和软件即服务 (SaaS)
A popular way to use network APIs is to connect the network to higher-level automation tools in private cloud environments. Private cloud environments enable automatic provisioning and teardown of the network, storage, and compute to reduce costs and increase the agility of data center infrastructure. The combination of automated network, storage, and compute is commonly referred to as infrastructure as a service and is the starting point for other cloud services, for example, platform as a service (PaaS) and software as a service (SaaS)
大多数现代 API 都提供 RESTful 接口(或 API),或者更确切地说遵守 REST。除了传输规范之外,REST 作为 API 最有趣的方面是服务器不需要维护状态。这意味着 RESTful 操作必须在一次调用中完成,并从网络设备的角度返回。例如,假设 RESTCONF 客户端需要在网络设备上配置三个静态路由:
Most modern APIs provide RESTful interfaces (or APIs), or rather adhere to REST. Beyond the transport specifications, the most interesting aspect of REST as an API is no maintenance of state is required at the server. This means a RESTful operation must complete in a single call and return from the network device’s perspective. For instance, say a RESTCONF client needs to configure three static routes on a network device:
1. 2001:db8:3e8:100::1 通过 2001:db8:3e8:110::1
1. 2001:db8:3e8:100::1 via 2001:db8:3e8:110::1
2. 2001:db8:3e8:110::1 通过 2001:db8:3e8:120::1
2. 2001:db8:3e8:110::1 via 2001:db8:3e8:120::1
3. 2001:db8:3e8:120::1 通过 2001:db8:3e8:130::1
3. 2001:db8:3e8:120::1 via 2001:db8:3e8:130::1
如果它同时安装所有三个路由,则客户端必须处理所有依赖项,例如无法安装 2001:db8:3e8:120::1(通往 110::1 的路由依赖于该路由)。服务器根本没有关于执行的命令之间的交互的任何状态,因此无法回滚或以其他方式修改通过 RESTful 接口串行传输的命令。
And if it installs all three routes at once, the client must handle any dependencies, such as the failure to install one 2001:db8:3e8:120::1, on which the route to 110::1 depends. The server simply does not have any state about the interaction between commands executed, and hence has no way to roll back or otherwise modify one command transmitted serially through a RESTful interface.
REST 最初被设计为 HTTP 协议的范例,因此 RESTful API 最常使用通用 HTTP 通过 HTTP 实现动词,例如 GET、PUT 和 DELETE。RESTful 客户端连接到接口,发送格式化数据(例如配置或显示命令),设备以 HTTP 代码和可选格式化数据进行响应。返回的HTTP代码是标准响应;例如,200=OK,401=未授权等,通知客户端命令是否成功。如图 26-2所示。
REST was originally designed as a paradigm for the HTTP protocol, and hence RESTful APIs are most often implemented over HTTP using common HTTP verbs, such as GET, PUT, and DELETE. A RESTful client connects to the interface, sends formatted data—for example, a configuration or show command— and the device responds with an HTTP code and optionally formatted data. The returned HTTP code is a standard response; for example, 200 = OK, 401 = unauthorized, etc., informs the client if the command was successful. Figure 26-2 illustrates.
在 RESTful 连接中发送和接收的数据需要某种结构化格式。常见的数据格式有XML、JSON、YAML(最初代表Yet Another Markup Language,后来改为YAML Ain't Markup Language,是一个递归缩写,和GNU一样,意思是GNU's Not UNIX)。
The data sent and received in a RESTful connection requires some structured formatting. Common data formats include XML, JSON, and YAML (originally standing for Yet Another Markup Language, but later changed to YAML Ain’t Markup Language, a recursive acronym, like GNU, which means GNU’s Not UNIX).
以下片段说明了通常用于在 RESTful 接口中格式化数据的三种标记系统。XML 是第一个示例:
The following snippets illustrate three markup systems often used to format data in a RESTful interface. XML is the first example:
<?xml version="1.0"encoding="UTF-8"?>
<root>
<Beatles>
<Revolver>
<歌曲>
<element>Taxman</element>
<element>Eleanor Rigby</element>
<element>I我只是在睡觉</element>
<element>爱你玛德琳</element>
<element>在这里,无处不在</element>
<element>艾莉说她说</element>
<element>美好的阳光</element>
<element>你的鸟会唱歌</element>
<element>没有人</element>
<element>亚历克斯医生</element>
<element>我想告诉你</element>
<element>必须让你进入我的生活</element>
<element>明天永远不知道</element>
</歌曲> <
/左轮手枪> <
/披头士>
</根>
<?xml version="1.0" encoding="UTF-8"?>
<root>
<Beatles>
<Revolver>
<Songs>
<element>Taxman</element>
<element>Eleanor Rigby</element>
<element>I'm Only Sleeping</element>
<element>Love You Madeline</element>
<element>Here, There and Everywhere</element>
<element>Ellie Said She Said</element>
<element>Good Day Sunshine</element>
<element>And Your Bird Can Sing</element>
<element>For No One</element>
<element>Doctor Alex</element>
<element>I Want to Tell You</element>
<element>Got to Get You Into My Life</element>
<element>Tomorrow Never Knows</element>
</Songs>
</Revolver>
</Beatles>
</root>
XML 是一种对描述性标签之间的信息进行编码的标记语言(XML 是超文本标记语言或 HTML 的超集,HTML 最初设计用于描述服务器通过 HTTP 提供的网页的格式)。编码信息在用户定义的模式中定义,使得任何数据都可以在系统之间传输。在网络自动化的情况下,XML 编码的数据可以是单个命令或整个配置。整个 XML 文档以文本形式存储,使其既可供机器读取又可供人类读取。
XML is a markup language that encodes information between descriptive tags (XML is a superset of the Hypertext Markup Language, or HTML, which was originally designed to describe the formatting of web pages served by servers through HTTP). The encoded information is defined within user-defined schema that enable any data to be transmitted between systems. In the case of network automation, XML-encoded data may be a single command or an entire configuration. The entire XML document is stored as text, making it both machine and human readable.
YAML 是第二个示例:
YAML is the second example:
披头士乐队:
左轮手枪:
歌曲:
- 税务员
- 埃莉诺·里格比
- 我只是在睡觉
- 爱你玛德琳
- 在这里,那里和无处不在
- 艾莉说她说
- 美好的一天阳光
- 你的鸟会唱歌
- 没有人
- 亚历克斯医生
-我想告诉你
- 必须让你进入我的生活
- 明天永远不知道
Beatles:
Revolver:
Songs:
- Taxman
- Eleanor Rigby
- I'm Only Sleeping
- Love You Madeline
- Here, There and Everywhere
- Ellie Said She Said
- Good Day Sunshine
- And Your Bird Can Sing
- For No One
- Doctor Alex
- I Want to Tell You
- Got to Get You Into My Life
- Tomorrow Never Knows
YAML 被认为是 JSON 的子集,被设计为非常易于人类阅读。与 JSON 类似,YAML 采用键|值对结构,并允许用户定义空格。额外的空格提高了 YAML 文档的可读性,但解析起来可能会占用大量资源。
YAML, considered a subset of JSON, is designed to be very human readable. Similar to JSON, YAML is structured in key|value pairs and allows for user-defined white space. The extra white space enables the readability of YAML documents but can be resource intensive to parse.
JSON 是第三个也是最后一个示例:
JSON is the third and final example:
{
“披头士乐队”:{
“左轮手枪”:{
“歌曲”:[
“税务员”、
“埃莉诺·里格比”、
“我只是在睡觉”、
“爱你玛德琳”、
“这里、那里和无处不在”、
“艾莉说”她说”、
“美好的阳光”、
“你的鸟会唱歌”、
“不为任何人”、
“亚历克斯医生”、
“我想告诉你”、
“让你进入我的生活”、
“明天永远”知道”
]
}
}
}
{
"Beatles": {
"Revolver": {
"Songs": [
"Taxman",
"Eleanor Rigby",
"I'm Only Sleeping",
"Love You Madeline",
"Here, There and Everywhere",
"Ellie Said She Said",
"Good Day Sunshine",
"And Your Bird Can Sing",
"For No One",
"Doctor Alex",
"I Want to Tell You",
"Got to Get You Into My Life",
"Tomorrow Never Knows"
]
}
}
}
XML 的最新替代方案是 JSON。JSON 在 RFC4627 中定义,并以结构化键|值对的形式对信息进行编码。JSON 文档中的键是系统之间可以理解的预定义标签;每个这样的标签都有一个关联值。例如,“Command”键的值可以是“show running configuration”。在典型情况下,键是一个值列表,用左括号和右括号表示,例如“[data1, data2, data3]”。与 XML 类似,JSON 存储为文本文件,并且是人类和机器可读的;然而,JSON 对于人类来说更容易交互。JSON 的主要优点是解析起来很简单,因为键可以引用值。
A more recent alternative to XML is JSON. JSON is defined in RFC4627, and encodes information in structured key|value pairs. The keys in a JSON document are predefined tags that are understood between systems; each such tag has a single associated value. For example, a key of “Command” may have a value of “show running configuration.” In typical cases, a key is a list of values, represented with open and close brackets—for example, “[data1, data2, data3].” Similar to XML, JSON is stored as a text file and is both human and machine readable; however, JSON is much easier for humans to interact with. The main advantage of JSON is that it is straightforward to parse because a key can reference values.
REST、XML、JSON 和 YAML 均支持多种不同的编程语言,包括 C、Java 和 Python。Python,主要是因为它的易用性,是网络可编程性和自动化项目的非官方标准。Python 语言易于编写、不易混乱,并且大多数操作系统都支持。Python 支持数千个库,这些库扩展了该语言以支持多种技术。
REST, XML, JSON, and YAML are all supported in a variety of different programming languages, including C, Java, and Python. Python, mainly because of its ease of use, is the unofficial standard for network programmability and automation projects. The Python language is easy to write, hard to mess up, and is supported on most operating systems. Python supports thousands of libraries that extend the language to support a wide variety of technologies.
API、Python、REST 和 JSON 结合在一起,通过可编程性实现网络或网络设备的自动化。大多数现代网络操作系统要求在特定 TCP 端口上启用 API 并配置身份验证方法。然后另一台计算机调用 Python 程序与该节点交互。如图 26-3所示。
APIs, Python, REST, and JSON come together to automate a network or network device via programmability. Most modern network operating systems require the API to be enabled on a particular TCP port and configure an authentication method. Then a different computer invokes a Python program to interact with the node. Figure 26-3 illustrates.
许多网络设备还支持机上自动化。机内自动化是在网络设备的管理平面上运行的脚本或过程,使网络操作员能够自动执行网络设备本地的配置或事件。由于机载自动化脚本随网络设备一起分发,因此它们更擅长处理链路故障或隔离类型事件。机载自动化工具由特定于供应商的产品组成,例如思科嵌入式事件管理器 (EEM)、Python 或 Linux Bash 脚本。
Many network devices also support on-box automation. On-box automation is a script or procedure running on the management plane of a network device, enabling network operators to automate configurations or events that are local to the network device. Because the on-box automation scripts are distributed with the network device, they are better at handling link failure or isolation-type events. On-box automation tools consist of vendor-specific offerings—for example, Cisco Embedded Event Manager (EEM), Python, or Linux Bash scripts.
Cisco EEM 是一种流行的思科设备内置自动化工具。Cisco EEM 具有事件检测器,例如环境问题或路由协议邻接更改以及将操作与事件关联的能力。EEM 的一个常见示例是,在接口出现故障的情况下,自动响应“关闭、不关闭”接口,或者在处理器利用率上升到超过特定百分比时收集有关进程和内存使用情况的信息。EEM 操作支持基于 CLI 的响应或更复杂的使用 Python 或 TCL 脚本的操作。
Cisco EEM is a popular on-box automation tool for Cisco devices. Cisco EEM features event detectors—for example, an environmental issue or routing protocol adjacency change and the ability to tie an action to an event. A common example of EEM is to, in the event of a downed interface, automate a response of “shut, no-shut” the interface, or collect information about processes and memory usage when the processor utilization rises above a specific percentage. EEM actions support CLI-based responses or more complex actions with Python or TCL scripts.
某些网络设备(例如 Cisco Nexus)支持使用 Python 或 BASH 脚本进行机上自动化。Python 或 BASH 脚本通常可在运行 Linux 作为底层操作系统的网络设备上使用;它允许网络管理员利用 Python 的灵活性来自动化网络或设备功能。示例 onbox 脚本可以在启动时或在某人使用特权访问权限登录后执行操作或生成警报。内置 Python 脚本可以模拟或替换已部署平台上不可用的功能。
Some network devices such as Cisco Nexus support on-box automation with Python or BASH scripts. Python or BASH scripting is normally available on network devices running Linux as the underlying OS; it allows the network administrators to automate network or device functions with the flexibility of Python. A sample onbox script may perform an action or generate an alert on bootup or after someone has logged in with privileged access. On-box Python scripts can simulate or replace features that are not available on a deployed platform.
基础设施自动化工具旨在管理和自动化操作系统、网络设备或资源。基础设施自动化工具可用于网络自动化;然而,这些工具在敏捷软件中更常见开发工具链,例如DevOps。基础设施自动化工具将连接网络设备并进行身份验证,并使用 CLI 或 API 进行更改。他们将有一个剧本或清单,详细说明如何与特定供应商设备进行交互以实现特定功能。基础设施自动化工具使网络能够以代码的形式表示,称为基础设施即代码(IaC)。IaC 支持敏捷的网络配置,因为 DevOps 团队在软件部署过程中部署或更改网络资源。迄今为止,已有许多基础设施自动化工具可用,但开源工具 Puppet 最受欢迎。
Infrastructure automation tools are designed to manage and automate operating systems, network devices, or resources. Infrastructure automation tools can be used for network automation; however, these tools are more common in agile software development tool chains, such as DevOps. Infrastructure automation tools will connect and authenticate with a network device and use either the CLI or an API to make changes. They will have a playbook or manifest detailing how to interact with a specific vendor device for a specific feature. Infrastructure automation tools enable the network to be represented as code, known as Infrastructure as Code (IaC). IaC enables agile network configurations because a DevOps team deploys or changes network resources as part of a software rollout. To date, a number of infrastructure automation tools are available, but the open source tool Puppet is most popular.
笔记
Note
DevOps 是将开发操作浓缩为一个词。DevOps 的总体思想是使用开发流程来管理运行网络的操作任务,例如管理配置和版本控制。
DevOps is development operations contracted into a single word. The general idea of DevOps is to use development processes to manage the operational tasks of running a network, such as managing configurations and versioning.
Puppet 软件包由 Puppet Labs 开发,是一个开源自动化工具集,用于通过强制执行设备状态(例如配置设置)来管理服务器和其他资源。
The Puppet software package, developed by Puppet Labs, is an open source automation toolset for managing servers and other resources by enforcing device states, such as configuration settings.
Puppet 组件包括在受管设备(节点)上运行的 puppet 代理和通常在单独的专用服务器上运行并为多个设备提供服务的 puppet master(服务器)。Puppet Agent 的操作涉及定期连接到 Puppet Master,后者依次编译配置清单并将其发送给代理;代理将此清单与节点的当前状态进行协调,并根据差异更新状态。
Puppet components include a puppet agent that runs on the managed device (node) and a puppet master (server) that typically runs on a separate dedicated server and serves multiple devices. The operation of the puppet agent involves periodically connecting to the puppet master, which in turn compiles and sends a configuration manifest to the agent; the agent reconciles this manifest with the current state of the node and updates state based on differences.
Puppet 清单是用于设置设备状态的属性定义的集合。检查和设置这些属性状态的详细信息是抽象的,因此清单可用于多个操作系统或平台。清单通常用于定义配置设置,但它们也可用于安装软件包、复制文件和启动服务。
A puppet manifest is a collection of property definitions for setting the state on the device. The details for checking and setting these property states are abstracted, so a manifest can be used for more than one operating system or platform. Manifests are commonly used for defining configuration settings, but they can also be used to install software packages, copy files, and start services.
网络中一个相对较新的组件是网络控制器。网络控制器提供分布式网络的整体管理以及网络自动化和可编程性的单一接口。控制器构建抽象层来简化网络管理,使网络自动化更容易。网络的抽象配置支持网络范围的配置,例如,在许多设备上设置新的 NTP 服务器。在这种情况下,网络运营商只需在控制器中设置配置,控制器将处理连接、身份验证并确保在每个设备上设置配置,如图26-4所示。
A relatively new component in networking is a network controller. Network controllers provide holistic management of a distributed network and a single interface for network automation and programmability. Controllers build an abstraction layer to simplify network management, making network automation easier. An abstracted configuration for a network enables networkwide configurations—for example, setting a new NTP server on a number of devices. In this case, the network operator would simply set the configuration in the controller, and the controller would deal with connecting, authenticating, and ensuring the configuration is set on every device, as illustrated in Figure 26-4.
一些网络控制器具有开箱即用的自动化功能来部署和管理网络。例如,思科 APIC 数据中心控制器可自动部署 VXLAN 以及许多其他技术。此外,网络控制器通过自动化复杂性并提供基于 GUI 的引导配置来简化网络功能的部署。
Some network controllers feature out-of-the-box automation to deploy and manage networks. For example, the Cisco APIC data center controller automates the deployment of VXLAN as well as many other technologies. Additionally, network controllers simplify deployment of network features by automating complexity and providing guided GUI-based configurations.
部署自动化,也称为零接触部署,自动部署新的网络节点。自动化部署可确保网络基础设施的新增功能,无论是初始部署还是故障更换。部署自动化减少了部署新节点的时间、风险和费用。
Deployment automation, also known as zero-touch deployment, automates the deployment of new network nodes. Automated deployments ensure new additions to the network infrastructure, either from initial deployment or replacement from failure. Deployment automation reduces the time, risk, and expense of deploying new nodes.
部署自动化技术需要设备向部署服务器请求部署自动化。设备可能具有可配置标志来请求下次启动时的配置,或者该请求可以像缺少配置一样简单。网络设备将使用通过 DHCP 或广播技术发现的信息来查找自动化服务器。然后,该节点将向自动化工具请求配置服务器。配置服务器必须使用可由自动化工具定制的模板化配置来响应请求。部署自动化工具的最后一步是通知网络管理员已添加新设备。
Deployment automation technologies require a device to request deployment automation from a deployment server. A device may have a configurable flag to request a configuration at next boot or the request can be as simple as lack of a configuration. The network device will find an automation server using information discovered through DHCP or with broadcast technologies. The node will then ask the automation tool for a configuration server. The configuration server must respond to the request using a templated configuration that may be customized by the automation tool. A final step for a deployment automation tool is to notify the network administrator a new device has been added.
部署自动化工具可用于数据中心、园区和广域网 (WAN) 环境。迄今为止,还没有部署标准自动化解决方案,每个供应商都将专有的解决方案推向市场。这些解决方案通常是大型网络管理工具的组件,例如用于广域网/园区环境的 Cisco Prime 或用于数据中心的思科数据中心网络管理器 (DCNM)。
Deployment automation tools are available for data center, campus, and wide area network (WAN) environments. To date, there is no standard for deployment automation solutions, and each vendor brings proprietary solutions to market. These solutions are normally a component of larger network management tools, such as Cisco Prime for WAN/campus environments or Cisco Data Center Network Manager (DCNM) for data centers.
新兴且非常流行的数据分析和机器学习领域将为下一代网络自动化提供动力。数据分析是一系列工具的总称,这些工具允许收集数据并将数据转换为有组织的、有洞察力的信息。如今网络设备生成的大部分数据都被丢弃了。这些丢弃的数据可以让网络运营商更好地了解网络设备(配置)的状态和运行状况(日志)、网络流量或穿越网络的应用程序的运行状况。
An emerging and very popular world of data analytics and machine learning will power the next generation of network automation. Data analytics is a general term for a series of tools allowing the collection of data and transforming the data into organized and insightful information. Much of the data that network devices generate today is discarded. This discarded data could give network operators better insight into the status and health (logs) of network devices (configurations), network traffic, or the health of applications traversing the network.
机器学习使计算机能够根据数据预测事件。例如,机器学习可以预测安全问题或预期的流量负载。然后,机器学习系统可以根据其预测更改网络配置,而无需人工干预。数据分析、机器学习和网络自动化的结合将实现自我配置、自我修复网络以及向自动化网络的过渡。
Machine learning enables computers to predict events from the data. For example, machine learning may predict a security issue or an expected traffic load. Machine learning systems can then change network configuration based on its predictions without human intervention. The combination of data analytics, machine learning, and network automation will enable self-provisioning, self-healing networks and the transition to automatic networks.
有关网络自动化和可编程性的其他资源:
Other resources on network automation and programmability:
红帽 Ansible。“Ansible 是简单的 IT 自动化。” 访问日期:2017 年 9 月 3 日。https ://www.ansible.com。
Ansible by Red Hat. “Ansible Is Simple IT Automation.” Accessed September 3, 2017. https://www.ansible.com.
比约克伦德,马丁. YANG 1.1 数据建模语言。征求意见 7950。RFC 编辑,2016。doi:10.17487/RFC7950。
Bjorklund, Martin. The YANG 1.1 Data Modeling Language. Request for Comments 7950. RFC Editor, 2016. doi:10.17487/RFC7950.
———。YANG——网络配置协议(NETCONF)的数据建模语言。征求意见 6020。RFC 编辑,2010。doi:10.17487/RFC6020。
———. YANG—A Data Modeling Language for the Network Configuration Protocol (NETCONF). Request for Comments 6020. RFC Editor, 2010. doi:10.17487/RFC6020.
“CFEngine — 使用 CFEngine 实现大规模、复杂和关键任务 IT 基础设施的自动化。” C引擎。访问日期:2017 年 9 月 3 日。https ://cfengine.com/。
“CFEngine—Automate Large-Scale, Complex and Mission Critical IT Infrastructure with CFEngine.” CFEngine. Accessed September 3, 2017. https://cfengine.com/.
“厨师:自动化基础设施和应用程序。” 厨师。访问日期:2017 年 9 月 3 日。https ://www.chef.io/。
“Chef: Automate Infrastructure and Applications.” Chef. Accessed September 3, 2017. https://www.chef.io/.
“期待——期待——主页。” 访问日期:2017 年 9 月 3 日。http ://expect.sourceforge.net/。
“Expect—Expect—Home Page.” Accessed September 3, 2017. http://expect.sourceforge.net/.
“可扩展标记语言(XML)。” 访问日期:2017 年 9 月 3 日。https ://www.w3.org/XML/。
“Extensible Markup Language (XML).” Accessed September 3, 2017. https://www.w3.org/XML/.
Goyvaerts,Jan。“正则表达式:完整教程”,2007 年 7 月。https: //www.princeton.edu/~mlovett/reference/Regular-Expressions.pdf。
Goyvaerts, Jan. “Regular Expressions: The Complete Tutorial,” July 2007. https://www.princeton.edu/~mlovett/reference/Regular-Expressions.pdf.
“Grpc / Grpc.io。” 访问日期:2017 年 9 月 3 日。https ://grpc.io/。
“Grpc / Grpc.io.” Accessed September 3, 2017. https://grpc.io/.
大卫·哈灵顿、伯特·维南和兰迪·普雷苏恩。用于描述简单网络管理协议 (SNMP) 管理框架的体系结构。征求意见 3411。RFC 编辑,2002。doi:10.17487/RFC3411。
Harrington, David, Bert Wijnen, and Randy Presuhn. An Architecture for Describing Simple Network Management Protocol (SNMP) Management Frameworks. Request for Comments 3411. RFC Editor, 2002. doi:10.17487/RFC3411.
Marshall,AD“远程过程调用”。《C 语言编程:使用 C 的 UNIX 系统调用和子例程》,2005 年。https://users.cs.cf.ac.uk/Dave.Marshall/C/node33.html。
Marshall, A. D. “Remote Procedure Calls.” In Programming in C: UNIX System Calls and Subroutines Using C, 2005. https://users.cs.cf.ac.uk/Dave.Marshall/C/node33.html.
保罗·迈耶、大卫·B·利维和鲍勃·斯图尔特。简单网络管理协议 (SNMP) 应用程序。征求意见 3413。RFC 编辑,2002。doi:10.17487/RFC3413。
Meyer, Paul, David B. Levi, and Bob Stewart. Simple Network Management Protocol (SNMP) Applications. Request for Comments 3413. RFC Editor, 2002. doi:10.17487/RFC3413.
“网络会议中心。” 访问日期:2017 年 9 月 3 日。http ://www.netconfcentral.org/。
“Netconf Central.” Accessed September 3, 2017. http://www.netconfcentral.org/.
彼特鲁沙、罗恩. “正则表达式语言——快速参考。” 文档,2017 年 3 月。https ://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference。
Petrusha, Ron. “Regular Expression Language—Quick Reference.” Documentation, March 2017. https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference.
普雷苏恩,兰迪。简单网络管理协议 (SNMP) 的协议操作版本 2。征求意见 3416。RFC 编辑,2002。doi:10.17487/RFC3416。
Presuhn, Randy. Version 2 of the Protocol Operations for the Simple Network Management Protocol (SNMP). Request for Comments 3416. RFC Editor, 2002. doi:10.17487/RFC3416.
“Puppet——通往更好软件的最短路径。” 傀儡。访问日期:2017 年 9 月 3 日。https ://puppet.com/。
“Puppet—The Shortest Path to Better Software.” Puppet. Accessed September 3, 2017. https://puppet.com/.
舍恩瓦尔德,于尔根。2002 年 IAB 网络管理研讨会概述。征求意见 3535。RFC 编辑,2003。doi:10.17487/RFC3535。
Schönwälder, Jürgen. Overview of the 2002 IAB Network Management Workshop. Request for Comments 3535. RFC Editor, 2003. doi:10.17487/RFC3535.
瑟洛、罗伯特. RPC:远程过程调用协议规范版本 2。征求意见 5531。RFC 编辑,2009。doi:10.17487/RFC5531。
Thurlow, Robert. RPC: Remote Procedure Call Protocol Specification Version 2. Request for Comments 5531. RFC Editor, 2009. doi:10.17487/RFC5531.
蒂舍尔、瑞安和贾森·古利。思科网络编程和自动化:数据中心、园区和 WAN 中的网络可编程性和自动化指南。第一版。印第安纳州印第安纳波利斯:思科出版社,2016 年。
Tischer, Ryan, and Jason Gooley. Programming and Automating Cisco Networks: A Guide to Network Programmability and Automation in the Data Center, Campus, and WAN. 1st edition. Indianapolis, IN: Cisco Press, 2016.
White, James E.基于网络的资源共享的高级框架。征求意见 707。RFC 编辑,1975 年。doi:10.17487/RFC0707。
White, James E. High-Level Framework for Network-Based Resource Sharing. Request for Comments 707. RFC Editor, 1975. doi:10.17487/RFC0707.
“XML 教程”。访问日期:2017 年 9 月 3 日。https ://www.w3schools.com/xml/。
“XML Tutorial.” Accessed September 3, 2017. https://www.w3schools.com/xml/.
1. 自动化网络设备配置的主要目标是什么?
1. What is the primary objective for automating network device configurations?
2. SNMP 的优点和缺点是什么?
2. What are the advantages and disadvantages of SNMP?
3.解释NETCONF、YANG和YANG模型之间的关系。
3. Explain the relationship between NETCONF, YANG, and a YANG model.
4、NETCONF和RESTCONF有什么区别?
4. What is the difference between NETCONF and RESTCONF?
5. 研究原子操作的概念。这与 RESTful 接口有何相似或不同?
5. Research the concept of an ATOMIC operation. How is this similar to, or different from, a RESTful interface?
6. 开发操作或网络配置自动化与软件定义网络 (SDN) 之间有什么区别?
6. What is the difference between development operations, or the automation of network configurations, and Software-Defined Networks (SDNs)?
1 . Schönwälder,2002 年 IAB 网络管理研讨会概述。
1. Schönwälder, Overview of the 2002 IAB Network Management Workshop.
苏有一个问题。基础设施团队刚刚安装了几个新的服务器机架,所有这些服务器都将运行需要防火墙和负载平衡服务的工作负载。Sue 有防火墙和负载平衡器,但她无法从新机架的位置轻松提供对它们的访问。
Sue had a problem. The infrastructure team had just stood up several new racks of servers, all of which were going to be running workloads needing firewall and load-balancing services. Sue had firewalls and load balancers, but they were not in a place where she could easily provide access to them from the location of the new racks.
当然,这项任务并非不可能。她从每个新机架创建了新的虚拟网络,并将它们添加到负载均衡器和防火墙的标记 VLAN 接口中。从那里,她创建了新的子接口,添加了适当的路由语句,并使其工作。流量从新机架回传,通过核心返回,运送到负载平衡段,通过防火墙,然后返回。工作完成了。
The task was not impossible, of course. She created new virtual networks from each of the new racks, and added them to the tagged VLAN interface of the load balancers and firewalls. From there, she created new subinterfaces, added appropriate route statements, and made it work. Traffic was backhauled from the new racks, back through the core, shipped to the load-balanced segment, passed through the firewall, and back. The job was done.
然而,负载均衡器配置非常笨重,支持数千个虚拟服务器、池成员和健康检查,混乱的状态压倒了管理界面。集群在一天中的某些时间也是一个瓶颈,一方面受处理器限制,另一方面又经历拥塞的网络链接。虽然已经有大量节点集群来处理大量流量和请求,但集群功能始终无法满足新的负载平衡服务的需求。
However, the load-balancer configuration was beyond unwieldy, supporting thousands of virtual servers, pool members, and health checks in a chaotic mess overwhelming the management interface. The cluster was also a bottleneck at certain times of the day, processor-bound on the one hand and experiencing congested network links on the other. Already a substantial cluster of nodes to handle the volume of traffic and requests, the cluster capabilities were never quite able to stay in front of the demand for new load-balancing services.
数据中心防火墙集群的状态大致相同。防火墙集群包含被数千条规则阻塞的庞大安全策略,正在成为一个棘手的瓶颈。这项政策是一场行政噩梦,充满了多年来来来往往的无数行政人员制定的规则。安全策略的更新是苏存在的祸根。她似乎永远找不到合适的地方来安装新的规则。与此同时,她不敢通过删除现有规则来简化政策,因为担心会破坏关键的业务服务。
The data center firewall cluster was in much the same state. Containing a massive security policy choked by multiple thousands of rules, the firewall cluster was becoming an intractable bottleneck. The policy was an administrative nightmare, filled with rules authored by a myriad of administrators who had come and gone over the years. Updates to the security policy were the bane of Sue’s existence. She could never seem to find quite the right place to install fresh rules. At the same time, she was afraid to simplify the policy by deleting existing rules for fear of breaking a critical business service.
与负载均衡器集群一样,防火墙集群也成为性能瓶颈。由于合规性要求要求对公司的大部分流量进行状态和深度数据包检查,因此架构团队引导越来越多的流量通过防火墙集群。集群无法再跟上。有时候,她觉得自己可以在自己的小隔间里感受到处理器的热量,看到近几个月的高峰工作时间利用率持续保持在 60%、然后是 70%、然后是 80%。
Like the load-balancer cluster, the firewall cluster was also becoming a performance bottleneck. As compliance requirements demanded both stateful and deep packet inspections for much of the company’s traffic, the architecture team directed ever increasing amounts of traffic through the firewall cluster. The cluster could no longer keep up. Some days, she thought she could feel the heat from their processors right in her cubicle, watching sustained 60%, then 70%, then 80% steady utilization during peak business hours in recent months.
Sue 进一步探索了集群的增长,但这个解决方案将解决容量问题,即使是暂时的。她还需要一种方法,使负载平衡和防火墙服务更接近随着客户群稳步增长而逐月推出的新机架。
Sue explored growing the clusters even further, but this solution would address the capacity problem, and even then temporarily. She also needed a way to move both load-balancing and firewall services closer to the new racks rolling in month after month as their customer base steadily rose.
她还希望减少这些集群所带来的管理噩梦。她不得不承认,在疲惫和时间压力下滚动浏览大量配置段落最终会导致停机——可能是一次大停机。无尽的复杂性并不是人类设计来有效应对的那种挑战。总有一天,她会犯一个错误——正如运营团队在自助餐厅里经常开玩笑的那样,这是一个“简历生成事件”。Sue 希望将死记硬背的配置任务移交给自动化系统,但很难弄清楚到底如何做。
She also wanted to reduce the administrative nightmare these clusters had become. She had to admit scrolling through massive configuration paragraphs while tired and under time pressure was eventually going to result in an outage—probably a big one. Endless complexity was not the sort of challenge humans were designed to deal with effectively. One of these days, she was going to make a mistake—a “resume-generating event,” as the operations team always joked about in the cafeteria. Sue wanted to turn over the rote configuration tasks to an automated system but was struggling to sort out exactly how.
随着时间的推移,苏研究了如何应对这些挑战,她将其划分如下:
As time went on, Sue researched how to address these challenges, which she compartmentalized as follows:
1.回传到特定网络位置的流量太有限。Sue 希望能够将服务转移到需要的地方,而不是将流量转移到服务所在的地方。
1. Backhauling traffic to specific network locations was too limiting. Sue wanted to be able to move the services to where they were needed, and not move the traffic to where the services were.
2.网络服务需要能够轻松扩展。添加新的集群成员太困难,而且运营开销太大。而且,负载问题并不是 24×7 的问题。只有在高峰时段才需要更多的容量。Sue 希望能够在不需要时缩小网络服务,并根据需要增加网络服务。
2. Network services needed to be able to scale easily. Adding new cluster members was too hard with too much operational overhead. And besides, the load problem was not a problem 24×7. Only during peak hours was there a need for more capacity. Sue wanted to be able to shrink network services when she did not need them as well as grow them on demand.
3.新网络服务的提供需要快速完成,并且出错的可能性有限。因此,管理域必须变得更加易于管理。需要将数千条规则或虚拟服务器或路由配置节分解为更小的、更易于自动化和理解的块。
3. Provisioning of new network services needed to be done quickly and with limited chance for error. Thus, the administrative domains had to become more manageable. Multiple thousands of rules or virtual servers or routing configuration stanzas needed to be broken up into smaller, easier-to-automate-and-understand chunks.
Sue 找到的解决方案是网络功能虚拟化 (NFV)。就像计算世界将裸机变成了虚拟机一样,网络世界也将裸机路由器、交换机、防火墙和负载均衡器变成了虚拟机。
The solution Sue found was Network Function Virtualization (NFV). Much like the compute world has turned bare-metal machines into virtual ones, the networking world has turned bare-metal routers, switches, firewall, and load balancers into virtual ones.
Sue 使用 VNF 将网络服务移至新机架附近。Sue 没有将服务回程到某个中央位置,而是在运行工作负载的同一机架中建立虚拟网络服务。通过这种方法,她获得了灵活性。
Using VNFs, Sue moved network services close to the new racks. Rather than backhauling services to some central location, Sue stood up virtual network services in the same racks where the workloads were running. With this approach, she gained flexibility.
Sue 还能够使用 NFV 方法获得可扩展性。Sue 不会添加新的集群成员,而是会在需要负载时建立额外的虚拟网络服务实例。
Sue was also able to gain scalability using the NFV approach. Rather than add new cluster members, Sue would stand up additional virtual network service instances when load required.
Sue 发现许多编排系统能够为她处理虚拟网络功能的启动和停止。她甚至能够将许多自动化任务集成到更大的计算堆栈编排方案中。当运营需要承受新的工作负载时,所需的网络服务也会随之出现,所有这些都由协调器处理。
Sue found many orchestration systems able to handle the spin-up and spin-down of virtual network functions for her. She was even able to integrate many of these automated tasks into the larger compute stack orchestration scheme. When operations would stand up a new workload, the networking services required would come up right along with it, all handed by the orchestrator.
Sue 将她的角色从无休无止、容易出错的配置工作转变为编排和自动化系统操作员之一。
Sue moved her role from one of endless error-prone provisioning to one of orchestration and automation system operator.
网络功能虚拟化 (NFV) 将曾经在专用网络硬件上运行的网络功能重新打包,以便它们可以在通用 x86 硬件上运行。“通用 x86 硬件”是指运行 x86 指令集的通用硬件平台。运行 Linux 和 Windows 操作系统以及虚拟机管理程序(例如 Xen Server 或 VMware ESXi)的服务器和 PC 就属于这一类。
Network functions virtualization (NFV) takes network functions once run on dedicated network hardware and repackages them so they can run on generic x86 hardware. “Generic x86 hardware” means a general-purpose hardware platform running an x86 instruction set. Servers and PCs running Linux and Windows operating systems and hypervisors such as Xen Server or VMware ESXi fall into this category.
以这种方式虚拟化的网络功能被巧妙地称为虚拟化网络功能(VNF)。您没听错:NFV 由 VNF 组成。需要关注的关键词是虚拟。Sue 可以通过虚拟化找到许多架构挑战的答案。
A network function that has been virtualized in this manner is called, cleverly enough, a Virtualized Network Function (VNF). You heard this right: NFV is made up of VNFs. The critical word to focus on is virtual. Sue can find the answers to many of her architectural challenges through virtualization.
首先,考虑 Sue 最初的挑战之一:回程流量。在 Sue 的场景中,她需要在主机和服务集群之间移动流量。考虑图27-1。
First, consider one of Sue’s initial challenges: backhauling traffic. In Sue’s scenario, she needed to move traffic between the hosts and a service’s clusters. Consider Figure 27-1.
在此示例中,流量从多个主机汇集,需要从主机上行到架顶交换机 (ToR) 的负载平衡服务器,传递到核心交换机,并最终到达负载平衡集群。这些设备集群可以通过增加处理能力来扩展,以处理更多数量的事务。这就产生了一个天然的瓶颈。主机流量必须从一系列 ToR 汇集到一个相对狭窄的位置,在该位置通过多个集群主机以物理形式提供负载平衡服务。嘴巴越宽,瓶颈感就越明显。
Traffic is funneled from several hosts requiring, in this example, load-balancing servers from hosts uplinked to top-of-rack switches (ToRs), passing into core switches, and eventually making it to a load-balancing cluster. These appliance clusters can be scaled up by increasing the amount of processing power in order to handle larger numbers of transactions. This creates a natural bottleneck. Host traffic must funnel from a collection of ToRs creating a wide mouth down a comparatively narrow-mouthed location where the load-balancing services were offered in a physical form factor via several clustered hosts. The wider the mouth becomes, the more egregiously the bottleneck is felt.
将此流量模式与图 27-2中的流量进行对比。
Contrast this traffic pattern with the flow of traffic in Figure 27-2.
在这个场景中,已经创建了一系列可以水平扩展或横向扩展的虚拟负载均衡器。而不是强迫流量通过漏斗瓶颈,创建与主机流量匹配的宽负载均衡口。漏斗被消除。
In this scenario, a series of virtual load balancers have been created that can scale horizontally, or scale out. Rather than traffic being forced through a funneled bottleneck, a wide load-balancing mouth matching the host traffic is created. The funnel is eliminated.
流入和流出单一类型服务的单一类型流量解决起来有些简单,但在现实世界的数据中心中,流量更有可能需要流经多个不同的 VNF。例如,流量可能是负载平衡方案的一部分,但也可能需要数据包过滤和深度数据包检查。流量必须按正确的顺序路由到每个服务。虽然某些流可能需要路由通过网络中可用的每个服务进行处理,但其他流可能只需要流过可用服务的子集。这是一个更难解决的问题。
A single type of traffic flowing into and out of a single type of service is somewhat simple to solve, but in real-world data centers, traffic flows will more likely need to flow through several different VNFs. For instance, a traffic flow might be part of a load-balancing scheme, but might also require packet filtering and deep packet inspection. Traffic must be routed to each of these services in the correct order. While some flows may need to be routed through every service available in the network for processing, others may need to flow through just a subset of the available services. This is a much more difficult problem to solve.
在传统的网络模型中,流量将流经所需的服务,因为网络是为了实现这一点而进行的。物理设备通过物理方式连接到网络中,因此流量只能通过正确的服务集进行路由。例如,在客户端和服务器之间,将安装防火墙。内联负载平衡器也将放置在路径中。交通自然会由于网络工程师创建的路由架构,流经所需的服务,如图27-3所示。
In a traditional network model, traffic would flow through required services because the network was plumbed to make it happen; physical appliances are physically wired into the network so traffic can only be routed through the correct set of services. For instance, in between a client and a server, a firewall would be installed. An inline load balancer, too, would be placed in the path. Traffic would naturally flow through the required services because of the routing architecture created by a network engineer, as illustrated in Figure 27-3.
如果客户端和服务器之间的某些特定流量需要在防火墙中应用一组略有不同的规则,或者不需要由负载均衡器管理,会发生什么情况?路径上的每个设备都必须配置某种方式来检测它们必须管理哪些流,并提供有关如何管理每个流的具体说明。随着时间的推移,在拥有数十万个流量的网络中,配置量以及管理这些配置所需的工作量会不断增加,变得难以管理。当然,配置过程可以自动化,但这并没有消除配置中涉及的复杂性,而是移动了网络中其他地方的复杂性。操作员不再管理这些复杂的配置,而是管理配置管理系统,而随着新的需求叠加到网络上,这些系统本身的复杂性往往会随着时间的推移而增加。
What happens if some particular traffic flow between the client and the server needs a slightly different set of rules applied in the firewall, or does not need to be managed by the load balancer? Each appliance along the path must be configured with some way to detect which flows they must manage, and with specific instructions on how to manage each one. Over time, in a network with hundreds of thousands of flows, the amount of configuration—and the amount of work required to manage those configurations—increases to become unmanageable. Of course, the configuration process can be automated, but this does not remove the complexity involved in the configurations, but rather moves the complexity someplace else in the network. Instead of humans managing these complex configurations, operators are managing configuration management systems—and these systems, themselves, tend to increase in complexity over time as new requirements are overlaid onto the network.
VNF 不仅允许网络运营商消除物理设备瓶颈,而且还允许为网络中的每种类型的流量创建单独的虚拟设备。每个虚拟设备可以具有更简单的配置,因为它可以插入到通过网络的一小部分流的路径中。
VNF not only allows network operators to eliminate the physical appliance bottleneck, but it also allows individual virtual appliances to be created for each type of traffic flow in the network. Each virtual appliance can have a much simpler configuration, because it can be inserted into the path of a small subset of the flows passing through the network.
但这两种可能性——虚拟化功能以避免在网络中安装物理设备来提供服务所带来的拓扑(或物理)瓶颈,以及虚拟化功能以缩小任何特定功能实例的焦点以降低复杂性——需要一种新的思维方式关于如何通过网络引导流量。在VNF场景中,流量在客户端和服务器之间传递时不会自然地流经必要的服务。由于所需的服务已经虚拟化,它们不再位于具有物理管道和路由架构的线路上,可以方便地通过所需的服务聚集流量。相反,VNF 是虚拟的,位于所需网络功能的物理路径之外。
But these two possibilities—virtualizing functions to avoid the topological (or physical) bottlenecks imposed by installing physical appliances in a network to provide services, and virtualizing functions to narrow the focus of any particular function instance to reduce complexity—require a new way of thinking about how to direct traffic through the network. In a VNF scenario, traffic is not naturally going to flow through necessary services when passing between client and server. Since the services required have been virtualized, they no long sit on the wire with physical plumbing and a routing architecture conveniently herding traffic through the services required. Rather, VNFs are virtual, residing out of the physical path of the network functions required.
解决这个问题的一种方法是通过服务链。服务链在允许流量按照自然路径到达目的地之前引导网络功能之间的流量。考虑图27-4。
One way to address this concern is through service chaining. Service chaining steers traffic between network functions before allowing it take its natural path to its destination. Consider Figure 27-4.
图 27-4中通过虚拟化功能表示了两条不同的路径:
Two different paths are represented through the virtualized functions in Figure 27-4:
• 来自客户端1 的流量通过状态数据包过滤器(SPF) 服务,然后到达网络地址转换(NAT) 服务,然后到达负载平衡器,最后传递到网络并到达其最终目的地。
• Traffic from client 1 is passed through a Stateful Packet Filter (SPF) service, then to a Network Address Translation (NAT) service, then to a load balancer, and then finally passed out to the network toward its final destination.
• 来自客户端2 的流量先通过SPF 服务,然后通过深度数据包检测(DPI) 服务,最后通过网络到达最终目的地。
• Traffic from client 2 is passed through an SPF service, then through a Deep Packet Inspection (DPI) service, and then through the network toward its final destination.
现在,每个流可以根据特定的基于流的要求仅通过所需的服务集。绕过的服务不需要配置为忽略不需要接触的流,也不需要切换与被忽略流相关的数据包,这既简化了配置,又减少了服务上不必要的负载。
Each flow can now pass through just the set of services required, based on specific flow-based requirements. Bypassed services do not need to be configured to ignore flows that they do not need to touch, nor to switch packets related to ignored flows, which both simplifies configuration and reduces unnecessary load on the service.
但如何以这种方式通过服务链接流量呢?服务链是网络领域的一项新兴技术。在撰写本文时,多个行业标准机构正在积极致力于标准化该方法。
But how can traffic be chained through services in this way? Service chaining is a nascent technology in networking. At the time of this writing, several industry standard bodies are actively working to standardize the approach.
笔记
Note
互联网工程任务组 (IETF) 和欧洲电信标准协会 (ETSI) 这两个组织正在积极构建网络功能虚拟化标准。这些领域的文档可以在https://datatracker.ietf.org/wg/sfc/documents/和http://www.etsi.org/technologies-clusters/technologies/nfv找到;读者还可以参考本章末尾的“进一步阅读”部分,获取有助于更深入地了解 NFV 所涉及的技术和架构的特定文档。
Two organizations, the Internet Engineering Task Force (IETF) and the European Telecommunications Standards Institute (ETSI), are active in building standards for network function virtualization. Documents in these areas can be found at https://datatracker.ietf.org/wg/sfc/documents/ and http://www.etsi.org/technologies-clusters/technologies/nfv; readers are also referred to the “Further Reading” section at the end of the chapter for specific documents useful for developing a deeper understanding of the technologies and architectures involved in NFV.
历史上,手动安装的策略路由(PBR)通过根据特定流量的特征做出转发决策来完成业务链。例如,来自具有特定 Internet 协议 (IP) 地址的主机的流量可能会路由到防火墙的接口。PBR 已被网络工程师用于异常路由。当流量需要走转发信息库 (FIB) 指示的标准路线以外的路线时,将安装路由策略来覆盖 FIB。
Historically, manually installed policy-based routing (PBR) has accomplished service chaining by making a forwarding decision based on the characteristics of a specific traffic flow. For example, traffic from a host with a specific Internet Protocol (IP) address might be routed to the interface of the firewall. PBR has been used by network engineers for exception routing. When traffic needs to go some way other than the standard way indicated by the Forwarding Information Base (FIB), a routing policy is installed to override the FIB.
因此,在有限的意义上,PBR可能充当服务链工具,但它更适合以物理设备为特征的传统网络拓扑,而不是VNF场景。PBR 众所周知难以管理,并且仅在本地具有重要意义。为了使 PBR 方案有效,必须在可能需要发生流量转向的每一跳上安装 PBR 策略。否则,流量将停止链接并最终根据 FIB 进行转发。
Therefore, in a limited sense, PBR might function as a service chaining tool, but it is better suited for legacy network topologies featuring physical appliances, rather than VNF scenarios. PBR is notoriously difficult to manage and is only locally significant. For a PBR scheme to be effective, a PBR policy must be installed every hop along the way traffic steering might need to occur. Otherwise, traffic will cease to be chained and will end up being forwarded in accordance with the FIB.
此外,PBR 也引入了与物理设备相同的不灵活性。整个交通转向系统变得依赖于可预测性。物理设备必须位于可预测的位置。网络架构必须是可预测的。但在 NFV 中,主要目标是灵活性——动态变化的网络设计和根据情况需要而变化的 VNF。
In addition, PBR introduces the same sort of inflexibility that physical appliances do. The entire traffic steering system becomes dependent on predictability. The physical appliances must be in a predictable place. The network architecture must be predictable. But in NFV, the primary goal is for flexibility—a dynamically changing network design and VNFs that come and go as the situation demands.
实现这种灵活性的一种方法是通过服务功能链 (SFC),它正在发展成为通过 VNF 架构路由流量的标准方法。在 SFC 中,流被分配一个网络服务标头 (NSH),其中包含定义服务以及流量流遍历服务的顺序的服务路径标识符。配套服务索引有助于路径验证和循环预防。沿链从一个服务移动到另一个服务需要以太网或 IP 源地址和目标地址,与以往一样。NSH创建的服务平面将服务路径标识符和服务索引映射到覆盖层;流的数据包将被封装,以在服务链的每个链路上路由它们。
One way to achieve this flexibility is through Service Function Chaining (SFC), which is evolving as a standard way to route traffic flows through an architecture of VNFs. In SFC, a flow is assigned a Network Service Header (NSH), which contains a service path identifier defining both the services and the order the services are to be traversed by the traffic flow. A companion service index assists with path validation and loop prevention. Moving the traffic flow along the chain from service to service requires Ethernet or IP source and destination addresses, the same as it ever has. The service plane created by NSH maps the service path identifier and service index to an overlay; the flow’s packets will be encapsulated to route them across each link of the service chain.
NSH 可能映射到许多封装,包括 VXLAN、GRE 和普通旧式以太网。NSH 的服务平面意味着服务链是独立于拓扑的,这对于部署为 VNF 的服务来说是一个至关重要的功能。
NSH might map to a number of encapsulations, including VXLAN, GRE, and plain old Ethernet. NSH’s service-plane means service chaining is topology independent, a crucial feature for services deployed as VNFs.
与网络技术中的许多其他事物一样,有多种方法可以沿着网络中的服务链移动流量。本章介绍了如何使用 NSH,它是数据包中包含的单独标头,但也有其他方法可以沿着网络中的特定路径引导流量。例如:
As with many other things in networking technologies, there are a number of ways to move traffic along a service chain in a network. The chapter describes using an NSH, which is a separate header included in a packet, but there are other ways to direct traffic along a specific path in a network, as well. For instance:
• 通过在网络中构建标签交换路径,其中每个网络设备读取外部标签,交换标签以将每个数据包定向到连接到网络的所需主机。这类似于 MPLS 流量工程 (TE)。
• By building a label-switched path through the network, where each network device reads the outer label, swapping labels to direct each packet to the required hosts connected to the network; this is similar to MPLS Traffic Engineering (TE).
• 通过构建标签堆栈,堆栈中的每个标签代表网络中的一跳或数据包需要访问以完成其服务链的服务(虚拟设备);例如,这可以使用分段路由 (SR) 来完成。
• By building a stack of labels, with each label in the stack representing a hop in the network or a service (virtual device) the packet needs to visit to complete its service chain; this can be done using Segment Routing (SR), for instance.
这些解决方案中的每一个都有各种积极和消极的属性,但是在任何特定网络中部署的解决方案将主要取决于硬件支持、谁拥有应用程序(修改应用程序以支持本机服务链的容易程度)以及是否有要求用于覆盖解决网络中的其他问题。
Each of these solutions has various positive and negative attributes, but the solution deployed in any particular network will mostly depend on hardware support, who owns the applications (how easily the applications can be modified to support service chaining natively), and whether there are requirements for an overlay to solve other problems in the network.
图27-5用于说明通过网络的服务链。
Figure 27-5 is used to illustrate a service chain through a network.
在图 27-5中:
In Figure 27-5:
1. 通过自动化流程、手动配置、编排系统等设置策略,指定源自 H1 的特定流需要经过 DPI 和 SPF 服务,然后再发送到特定服务进行处理。目标服务有多个实例,因此流还必须经过负载均衡器。该策略被注入到网络中的始发进程或主机(如果有能力将某种类型的服务链标头强加到传输的数据包上),或者配置为在第一跳路由器处强加服务链标头的过滤器 -在本例中,是数据中心结构上的 ToR 设备。
1. A policy is set through an automated process, manual configuration, orchestration system, etc., that specifies a particular flow originating at H1 needs to pass through DPI and SPF services before being sent to a particular service for processing. There are several instances of the destination service, so the flow must also pass through a load balancer. This policy is injected into the network either at the originating process or host (if either has the ability to impose service chain headers of some type onto transmitted packets), or configured as a filter with an imposed service chain header at the first hop router—in this case, a ToR device on a data center fabric.
2. 流量转发到服务链上指示的第一个服务;如果使用 IPv6 NSH,网络设备将根据服务链中的第一个或“顶部”服务转发数据包,而不是根据目标 IP 地址。如果使用某种形式的标签交换或堆栈,则将根据堆栈中最外层的标签转发数据包。当流量到达虚拟 DPI 服务时,将检查数据包的内容是否存在恶意软件等。
2. The traffic is forwarded to the first service indicated on the service chain; if an IPv6 NSH is being used, the network devices will forward the packet based on the first, or “top,” service in the service chain, rather than based on the destination IP address. If some form of label swapping or stacking is being used, the packet will be forwarded based on the outermost label in the stack. When the traffic reaches the virtual DPI service, the contents of the packet are inspected for malware, etc.
3. 服务链中的第一段从数据包标头中移除,并将数据包发送回数据中心结构以进行第二个服务。同样,网络设备需要转发此流中的数据包基于服务链上的“顶级”服务,而不是目的IP地址。
3. The first segment in the service chain is removed from the packet header and the packet transmitted back onto the data center fabric toward the second service. Again, the network devices need to forward the packets in this flow based on the “top” service on the service chain, rather than the destination IP address.
4. 当数据包到达第二个虚拟服务时,它会与有状态数据包过滤器中的本地状态进行匹配,以确保 H1 被允许访问目标服务、存在现有流等。顶级服务再次被剥离服务链(或根据需要删除/交换标签),并且数据包被转发回数据中心结构。
4. When the packet arrives at the second virtual service, it is matched against local state in the stateful packet filter to ensure H1 is allowed to access the destination service, there is an existing flow, etc. The top service is again stripped off the service chain (or labels removed/swapped as needed), and the packet is forwarded back onto the data center fabric.
5. 当数据包到达虚拟负载均衡器时,负载均衡器将检查它是否是现有流的一部分,并修改标签、NSH 标头或其他信息以确保数据包转发到正确的目标服务器从提供目标服务的服务器组中选出。此时,可以移除IPv6 NSH和/或流标签,并且使用本机IPv6查找将数据包通过数据中心数据包转发到最终目的地服务器。然后,数据包会再次转发到数据中心结构,以传送到其最终目的地。
5. When the packet arrives at the virtual load balancer, the load balancer will check to see if it is part of an existing flow, and modify the label, NSH header, or other information to ensure the packet is forwarded to the correct destination server out of the group of servers providing the destination service. At this point, the IPv6 NSH and/or flow labels may be removed, and the packet forwarded using native IPv6 lookups through the data center packet to the final destination server. The packet is then forwarded one more time onto the data center fabric for delivery to its final destination.
此过程可能看起来很复杂,但它比连接整个网络以便该流中的流量通过三个独立的设备并按每个流管理每个设备上的配置要简单得多。
This process may appear to be complex, but it is much less complex than wiring the entire network so that the traffic in this flow would pass through three separate appliances, and managing the configurations on each device on a per-flow basis.
网络设计灵活性意味着组织可以在需要时创建 VNF。服务可以分布在许多较小的实例上,而不是强制所有流量通过单个大型负载均衡器实例或基于庞大设备的防火墙。通过服务链,创建这些 VNF 的位置不再依赖于网络布局。流量可以通过策略和服务链引导到服务 VNF 所在的任何位置。
Network design flexibility means organizations can create VNFs when they need them. Rather than force all traffic through a single massive load-balancer instance or gargantuan appliance-based firewall, services can be spread across many smaller instances. With service chaining, there is no longer a dependency on network placement for where those VNFs are created. Traffic can be directed by policy and service chain to wherever the servicing VNF is located.
这种横向扩展策略听起来像是简单地向物理集群添加成员以增加容量。然而,从某种意义上说,集群只不过是向上扩展,而不是向外扩展。容量增加的集群仍然作为一个单元而不是离散单元发挥作用。将集群成员添加到网络服务功能的结果是处理能力的增加,但并没有带来真正的横向扩展架构所提供的优势。
This scale-out strategy might sound like simply adding members to a physical cluster to increase capacity. However, in a sense, clustering is nothing more than scaling up, and not out. A cluster with increased capacity still functions as a unit rather than discrete units. The result of adding cluster members to a network service function is an increase in processing capacity but does not come with the advantages that true scale-out architecture offers.
例如,所有集群必须与彼此或与集群控制器保持完全联系。当这种联系被打破时,集群被称为分区。分区时,每个分区都将充当自己的集群,这种情况称为裂脑。当分区修复后,集群必须协调,解决由于分区集群独立运行而不可避免的差异。
For example, all clusters must remain in full contact with either one another or a cluster controller. When this contact is broken, the cluster is said to be partitioned. While partitioned, each partition will function as its own cluster, a condition known as split-brain. When the partition is healed, the cluster must reconcile, sorting out the inevitable differences resulting from the partitioned clusters acting independently.
集群还容易遭受重大系统故障的影响,整个系统可能会因操作系统故障或协同攻击而离线。
Clusters are also subject to major system failures, where the entire system might be taken offline due to an operating system fault or coordinated attack.
使用 VNF 真正扩展网络功能,将服务功能分解为离散的、独立运行的单元。虽然 VNF 很可能受到集中管理,但它们并不作为单个设备运行。因此,它们不会受到分区集群或攻击的缺点。
Truly scaling network functions out using VNFs breaks service functions down into discrete, independently functioning units. While VNFs are highly likely to be managed centrally, they do not operate as a single device. Therefore, they are not subject to the foibles of partitioned clusters or attacks.
当一个VNF出现故障时,爆炸半径仅限于流经一个VNF的流量。执行相同功能的其他 VNF 不会受到故障 VNF 的影响。单个小型实例故障对生产计算环境的损害远小于大规模集群故障。
When a VNF fails, the blast radius is limited to the traffic flowing through one VNF. Other VNFs performing identical functions are not affected by the failed VNF. A single small instance failing hurts a production computing environment much less than a massive cluster failing.
举例来说,考虑一个名为 NetBuckPay 的虚构支付处理组织。NetBuckPay 通过其连接的多个不同网络为其客户提供多种支付网关。其中一个网关是 XML 服务。另一个使用 JSON。另一种使用专有格式。
By way of example, consider an imaginary payment processing organization called NetBuckPay. NetBuckPay offers several payment gateways to its customers across several different networks they connect to. One of the gateways is an XML service. Another uses JSON. Another uses a proprietary format.
如果 NetBuckPay 使用传统网络服务模型,它可能会使用大型防火墙集群来确保安全,使用入侵检测集群来进行深度数据包检查,并使用负载平衡集群来跨网关服务器池喷射交易。
If NetBuckPay was using the legacy network services model, it might use a large firewall cluster for security, an intrusion detection cluster for deep packet inspection, and a load-balancing cluster to spray transactions across pools of gateway servers.
如果……会发生什么?
What happens if…
• 负载平衡集群失败?
• The load-balancing cluster fails?
• 防火墙集群出现故障?
• The firewall cluster fails?
• 入侵检测集群出现故障?
• The intrusion detection cluster fails?
• 这些集群之间的网络路径是否出现故障?
• The network path between any of these clusters fail?
• 是否有任何集群因流量而不堪重负?
• Any of the clusters becomes overwhelmed with traffic?
• 任何集群受到攻击吗?
• Any of the clusters is attacked?
细心的读者会认为,设计良好的集群可以容忍成员的中断,而设计良好的网络可以容忍网络路径中的故障。细心且经验丰富的读者也会知道系统往往会以复杂且意想不到的方式失败。精明的信息技术架构师总是在寻找方法来减少故障系统的潜在影响范围。
The observant reader will argue a well-designed cluster can tolerate the outage of a member and a well-designed network would tolerate a failure in the network path. The observant and experienced reader will also know systems tend to fail in complex and unexpected ways. Savvy information technology architects are always looking for ways to reduce the potential blast radius of a failed system.
在最坏的情况下,冗余系统以惊人且意想不到的方式发生故障,爆炸半径是多少?在故障恢复完成之前,NetBuckPay 将失去其向客户提供的所有三个支付网关。
In a worst-case scenario where a redundant system fails in a spectacular and unexpected way, what is the blast radius? NetBuckPay would lose all three payment gateways it offers to its customers until the failure recovery was complete.
如果 NetBuckPay 使用 VNF 模型,则可以将 VNF 专用于每个网关的负载平衡、状态数据包过滤和入侵检测服务。假设设计合理,这将减少单个支付网关故障的影响范围。并非所有支付网关客户都会因共同共享资源故障而离线,而是特定客户会因包含的离散故障而受到影响。
If NetBuckPay was using a VNF model, it would be possible to dedicate VNFs for load-balancing, stateful packet filtering, and intrusion detection services to each gateway. Assuming competent design, this would reduce the blast radius of a failure to a single payment gateway. Rather than all payment gateway customers being offline due a commonly shared resource failure, select customers would be impacted as the result of a contained, discrete failure.
回顾 Sue 面临的挑战,其中之一就是供应困难。随着网络服务的使用越来越广泛,其配置也变得越来越复杂。网络服务执行核心功能,但执行该功能的方式以及对特定情况的独特处理会导致描述设备行为方式的冗长命令行节。对于由图形用户界面 (GUI) 驱动的设备,界面中会填充一页又一页的带有配置信息的屏幕。
Reflecting on Sue’s challenges, one of them was difficulty in provisioning. As network services become increasingly utilized, their configurations become increasingly complex. Network services perform a central function, but the way in which the function is performed along with unique handling for specific situations results in lengthy command-line stanzas describing how the device is to behave. For devices driven by a graphical user interface (GUI), pages upon pages of screens with configuration information populate the interface.
对于网络运营商来说,配置管理是他们在 IT 组织中角色的关键部分。有效、准确地管理配置是实现应用程序的重要组成部分。
For network operators, configuration management is a critical part of their roles in an IT organization. Managing configurations effectively and accurately is an essential part of bringing an application to life.
但正如 Sue 所说,配置管理也是最容易出错的。对于大型配置,人们有可能在配置中迷失方向。发生冲突的配置对象越多,在不干扰现有配置功能的情况下进行添加就越困难。相反,删除看似过时的配置数据是有风险的,因为证明哪些配置元素正在使用或未使用是具有挑战性的。
But as Sue recounted, configuration management is also the most error-fraught. With large configurations come opportunities for a human to get lost in the configuration. The more configuration objects there are to collide with, the more difficult it is to make additions that do not disturb the existing configuration functionality. Conversely, deleting what seems to be stale configuration data is risky, as proving what configuration elements are or are not in use is challenging.
VNF 帮助网络人员解决配置问题。假设进行手动配置,专用于特定用途的单个 VNF 将具有人类必须处理的小得多的配置数据集。这减少了出错的机会,也减少了简单地找出合适的配置所需的时间。
VNFs help networkers with the configuration problem. Assuming manual configuration is being done, a single VNF dedicated to a specific purpose will have a much smaller set of configuration data that a human being must work through. This reduces the opportunity for error as well as the time required to simply sort out what the appropriate configuration might be.
然而,VNF 通常以自动化方式进行管理,其中中央策略管理器会站出来拆除服务。人类与中央策略管理器交互。策略管理器处理 VNF 及其配置。
However, VNFs are often managed in an automated way, where a central policy manager stands up and tears down services. The human interacts with the central policy manager. The policy manager handles the VNFs and their configuration.
随着 VNF 的日益普及,用于管理其策略的技术也不断发展。需要考虑的一个示例是状态数据包过滤策略管理。传统上,状态数据包过滤器通过非常详细的级别允许或拒绝流量流的规则进行管理,可能包括源 IP 地址、目标 IP 地址、源端口号、目标端口号、是否存在现有会话、甚至各种传输控制协议 (TCP) 标志。换句话说,粒度流信息用于描述每个规则。
As VNFs have grown in popularity, the techniques used to manage their policies have evolved. One example to consider is stateful packet filter policy management. Traditionally, stateful packet filters have been managed by rules permitting or denying traffic flows at a very detailed level, potentially including the source IP address, the destination IP address, the source port number, the destination port number, whether there is an existing session, and even various Transmission Control Protocol (TCP) flags. In other words, granular flow information is used to describe each rule.
应用程序通常使用多个不同的端口进行通信。主机可能使用多个不同的 IP 地址进行通信。因此,根据细粒度规则构建有状态的数据包检测策略非常具有挑战性。随着需要通过状态数据包过滤器允许的应用程序数量的增加以及为应用程序提供服务的主机数量的增加,挑战也随之增加。随着时间的推移,传统的状态包过滤策略管理就会失效。
Applications often use several different ports to communicate. Hosts might use several different IP addresses to communicate. Therefore, building a stateful packet inspection policy out of granular rules is enormously challenging. The challenge grows as the number of applications that need be permitted through the stateful packet filter grows and as the number of hosts involved in serving the applications grows. Over time, traditional stateful packet filtering policy management fails.
当考虑将许多小型数据包过滤器部署为 VNF 时,传统的状态数据包过滤器策略管理并不是一个现实的选择。为了处理 VNF 数据包过滤器管理,使用中央策略管理。编写了利用元数据的单个策略。
Traditional stateful packet filter policy management is not a realistic option when considering many small packet filters deployed as VNFs. To handle VNF packet filter management, central policy management is used. A single policy leveraging metadata is written.
在这种情况下,元数据是指对对象进行分组的不太精细的方式。例如,用户可能按 Microsoft Active Directory 中的对象进行分组。应用程序可以按名称分组。主机可能按 DNS 后缀分组。利用元数据,人们可以编写这样的策略:“名称中包含‘web’的主机可以对名称中包含‘dbase’的主机执行 SQL 查询。”
In this context, metadata refers to less granular ways to group objects. For example, users might be grouped by an object in Microsoft’s Active Directory. Applications might be grouped by name. Hosts might be grouped by DNS suffix. Leveraging meta-data, humans can write policies that say, “Hosts containing ‘web’ as part of their name can perform SQL queries against hosts containing ‘dbase’ as part of their name.”
中央策略管理器软件分析策略、元数据以及由 VNF 数据包过滤器过滤的主机。然后,策略管理器为每个 VNF 有状态数据包过滤器编译并部署正确的数据包过滤规则。如图 27-6所示。
The central policy manager software analyzes the policy, the metadata, and the hosts filtered by VNF packet filters. The policy manager then compiles and deploys the correct packet filter rules for each VNF stateful packet filter. Figure 27-6 illustrates.
这种方法消除了网络运营商精细管理的负担,将其转移到软件上。策略管理软件使管理动态 VNF 成为可能。
This approach removes the burden of granular management from the network operator, shifting it to software. Policy management software makes managing dynamic VNFs possible.
用于处理复杂配置的新兴技术是基于意图的网络。基于意图的网络允许使用简单的语言来表达配置需求。传统配置准确地描述了网络的行为方式,而基于意图的网络则描述了期望的结果,但没有描述如何实现结果。
An emerging technique being used to handle complex configuration is intent-based networking. Intent-based networking allows for plain language to be used to express a configuration desire. While traditional configuration describes exactly how the network is to behave, intent-based networking describes the hoped-for outcome, but not how the outcome will be achieved.
基于意图的网络在 VNF 环境中很有趣,因为它在网络状态和配置细节之间引入了一个抽象层。意图引擎解释由人类或可能的软件表达的通用意图,然后发送将意图转换为网络状态所需的特定配置(或个性化策略)。图 27-7说明了这一点。
Intent-based networking is interesting in the context of VNFs because it introduces a layer of abstraction between network state and configuration specifics. The intent engine interprets the generic intent expressed by a human or possibly software, and then sends the specific configuration (or individuated policies) required to convert the intent into network state. Figure 27-7 illustrates.
意图也是强制实现所需网络状态的有用工具。如果意图用于描述所需的网络状态,并且意图引擎可以将这些指令解释为配置细节,则还可以利用意图引擎来确定网络状态何时不再与所表达的意图匹配。
Intent is also a useful tool in enforcing a desired network state. If intent is used to describe the desired network state, and the intent engine can interpret those directives into configuration specifics, then the intent engine can also be leveraged to determine when the network state no longer matches the expressed intent.
意图仍然很新,而且事实证明很难实施。网络硬件和网络操作系统软件的巨大差异引入了许多变量,使得实现基于意图的网络成为复杂的编程挑战。尽管如此,特定的基于意图的网络实现已经进入开源项目,例如开放网络操作系统 (ONOS) 和 OpenDaylight。此外,几种商业变体已被引入市场,其他变体预计很快就会上市。
Intent is still very new, and proving difficult to implement. Wide variations in network hardware and network operating system software introduce many variables that make implementing intent-based networking a complex programming challenge. Nonetheless, specific intent-based networking implementations have found their way into open source projects such as Open Network Operating System (ONOS) and OpenDaylight. In addition, several commercial variants have been introduced to the market, with others expected to find their way to market soon.
VNF 自动化的主要好处是缩短服务时间。将应用程序快速推向市场是 VNF 的一个关键优势,因为它们允许以最小的风险且无需人工配置来组合和实例化服务。
The chief benefit of VNF automation is a decreased time to service. Bringing applications to market quickly is a critical benefit of VNFs, as they allow for a service to be composed and instantiated with a minimum of risk and without human configuration.
基于元数据和基于意图的网络的集中策略管理是能够缩短 VNF 启动时间的工具示例。
Centralized policy management based on metadata and intent-based networking are examples of tooling that enable VNF spin-up time to be reduced.
交换机和路由器等物理网络设备在专用芯片上运行。其他网络设备(例如防火墙和负载平衡器)可能在通用 CPU 上运行,同时将特定功能卸载到专用硬件。加密就是一个例子,其中负载均衡器可能在通用 CPU 上运行大多数功能,而数学密集型数据包加密则被卸载到专用芯片以保持高吞吐量水平。
Physical network devices such as switches and routers run on built-for-purpose silicon. Other network devices such as firewalls and load balancers might run on general-purpose CPUs while offloading specific functions to dedicated hardware. Encryption is an example of this, where a load balancer might run most functions on a general-purpose CPU, while mathematically intensive packet encryption is offloaded to dedicated silicon to maintain high throughput levels.
专用硅芯片称为专用集成电路 (ASIC)。ASIC 是由网络供应商设计的,用于执行少量网络功能并快速完成。ASIC 的功能有限——它们是“特定于应用程序。” 因此,网络设备使用 ASIC 来执行任务,以完成其擅长的任务,但无法执行超出 ASIC 设计用途的任何任务。
Built-for-purpose silicon chips are called Application-Specific Integrated Circuits (ASICs). ASICs are designed by networking vendors to perform a small number of networking functions and do them quickly. ASICs are limited in function—they are “application specific.” Thus, network devices perform tasks using ASICs to do what they do very well but cannot perform anything beyond what the ASIC was designed for.
与 ASIC 不同,通用处理器用于服务器和台式计算机。通用处理器设计用于运行各种软件。Intel x86 处理器是最广为人知的通用处理器;据说已经虚拟化的网络功能在 x86 上运行。
In contrast to ASICs, general-purpose processors are found in servers and desktop computers. General-purpose processors are designed to run a wide variety of soft-ware. Intel x86 processors are the most widely known general-purpose processor; network functions that have been virtualized are said to be running on x86.
虽然 ASIC 在少数事情上做得非常好,但 x86 处理器却只能勉强完成大量事情。这对于理解 VNF 至关重要。当网络功能被虚拟化以在 x86 上运行时,与在 ASIC 上运行的相同功能相比,性能可能会降低。
While ASICs do a small number of things extremely well, x86 processors do a large number of things merely adequately. This is critical to understanding VNFs. When a network function is virtualized to run on x86, performance might be reduced when compared to the same function running on an ASIC.
由于两个主要问题,有关 VNF 和性能的讨论很常见。
Discussions about VNFs and performance are commonplace due to two major concerns.
1. VNF 需要足够快地运行以填充其运行的主机的网络管道。10、25、40 甚至更高的以太网速度对于利用 VNF 的网络运营商来说都很有趣。
1. VNFs need to function quickly enough to fill the network pipe of the host they run on. Ethernet speeds of 10, 25, 40, and even higher are all interesting to network operators leveraging VNFs.
2. VNF使用主机中的通用x86 CPU来提供服务。这些主机还将运行数据中心所需的其他工作负载 - Web 服务器、数据库引擎等。用于 VNF 的 CPU 周期不可用于其他工作负载。
2. VNFs use the general-purpose x86 CPU in hosts to provide their services. These hosts are also going to be running other workloads the data center requires—web server, database engines, and so on. CPU cycles used for VNFs are not available for those other workloads.
VNF 软件架构师使用两种重要方法从其网络服务中获得足够的吞吐量。第一个手段是软件优化。第二是硬件卸载。
There are two significant means VNF software architects use to gain sufficient throughput from their network services. The first means is software optimization. The second is hardware offload.
软件优化与运行许多虚拟化网络功能的操作系统(Linux)的特性有关。Linux 在 Linux 内核中处理网络功能。但是,VNF 软件将以用户身份运行。当用户空间程序需要访问网络接口硬件来发送或接收数据包时,它将对内核执行系统调用。执行硬件中断,并在内核和用户空间之间复制数据。所有这些都需要时间,从而降低了 VNF 可能实现的最大吞吐量。
Software optimization is related to the peculiarities of the operating system that many virtualized network functions run in, Linux. Linux processes network functions in the Linux kernel. However, the VNF software is going to be running as a user. When the user space program needs access to the network interface hardware to send or receive packets, it will perform a system call to the kernel. A hardware interrupt is performed, and data is copied between kernel and user space. All of this takes time, reducing the maximum amount of throughput the VNF might otherwise achieve.
VNF 的软件优化旨在消除内核空间和用户空间之间的往返。许多网络软件堆栈完全在 Linux 用户空间中运行。为了访问网络硬件,这些用户空间堆栈利用称为数据平面开发套件 (DPDK) 的开源项目。DPDK为网络堆栈提供了一种直接访问主机内部网络接口的方法,而无需对内核执行系统调用。这减少了延迟,从而提高了吞吐量。
Software optimization of VNFs seeks to eliminate the back-and-forth between kernel space and user space. Many networking software stacks run completely in Linux user space. To gain access to the network hardware, these user space stacks leverage an open source project called Data Plane Development Kit (DPDK). DPDK provides a means for a networking stack to directly access the networking interface inside a host without having to perform system calls to the kernel. This reduces latency, subsequently increasing throughput.
硬件卸载由带有定制芯片的网络接口卡 (NIC) 组成,旨在从某些 VNF 任务中卸载 x86 CPU。与功能较差的 NIC 相比,采用定制芯片的 NIC 成本更高,并且专门针对特定环境。这些定制 NIC 使用特殊驱动程序运行,将功能从特定 VNF 移交给硬件以加速它们。
Hardware offload consists of network interface cards (NICs) with customized silicon designed to offload the x86 CPU from some VNF tasks. NICs with customized silicon are costly compared to less-capable NICs, as well as being specialized for specific environments. These custom NICs run with special drivers, handing off functions from specific VNFs to hardware to accelerating them.
如果您还没有找到权衡,那么您还没有认真考虑。
If you have not found the tradeoffs, you have not looked hard enough.
这在工程(和生活!)的每个领域都是如此,包括 NFV。本章主要考虑了 NFV 的情况。有哪些权衡?状态/优化/表面 (SOS) 三元组将有助于评估这些权衡。
This is true in every area of engineering (and life!), including NFV. This chapter has largely considered the case for NFV. What are the tradeoffs? The State/Optimization/Surface (SOS) triad will be useful in evaluating these tradeoffs.
NFV 不是将一个设备或一组设备放入数据包流经网络的路径中,而是将数据包传送至服务。反过来,这些服务可以分散在整个网络中,包括位于数据中心结构上的任何位置。如果您将网络中的流量移动视为优化问题,则 NFV 需要有关服务按顺序位于何处的更精细信息。本质上,服务成为目的地,而不是逻辑子网。那么,NFV 将要求控制平面承载更多的状态。
Rather than putting an appliance, or a cluster of appliances, into the path of packets flowing through the network, NFV brings the packets to the services. These services, in turn, could be scattered throughout the network, including being located anyplace on a data center fabric. If you consider the movement of traffic through the network as an optimization problem, NFV requires more granular information about where services are located in order. In essence, the service becomes the destination, rather than a logical subnet. NFV, then, will require the control plane to carry greater amounts of state.
与此同时,虚拟化服务可能比物理集群或设备中的服务更频繁地移动;与在新服务器上重新启动服务相比,对设备进行拆卸、拆架、架设和固定要困难得多。这意味着服务可以更快地移动,“因为它们可以”(降低成本通常会导致更少的纪律),而且因为一旦网络被视为“免费”,服务“希望”移动到最高质量、最低成本。可能的成本计算资源。
At the same time, virtualized services are likely to move more often than services in a physical cluster or appliance; it is more difficult to unbolt, unrack, rack, and bolt an appliance than it is to respin a service on a new server. This means services move around more quickly both “because they can” (lowered cost often leads to less discipline) and because, once the network is perceived as “free,” the service “wants” to move to the highest-quality, lowest-cost compute resources possible.
状态的另一个方面也需要考虑:网络中策略的分发。很简单地说,“分布在整个网络中的较小的状态块比一个大型配置或状态存储更容易管理。” 每个单独的国家都更小,更适合当地的需要(回归辅助性原则)。另一方面,理解广泛分布的状态如何交互可能要困难得多。单个配置文件中两个状态之间的交互可能很难理解,并且交互在两个不同设备上配置的两个状态之间,在网络中广泛分离,并且具有不同的配置参数和/或风格,很容易进入“不可能”的领域。
There is another aspect of state to consider, as well: the distribution of policy in the network. It is simple enough to say, “smaller chunks of state spread throughout the network are easier to manage than one large configuration or state store.” Each individual piece of state is smaller and more tuned to the local need (a return to the principle of subsidiarity). On the other hand, understanding how widely distributed state interacts can be a lot more difficult; the interaction between two pieces of state in a single configuration file can be difficult to understand, and the interaction between two pieces of state configured on two different devices, widely separated in the network, and with different configuration parameters and/or styles can easily move into the “impossible” territory.
用于配置和管理大量设备的所有附加状态也必须在网络上承载,这意味着必须在网络上承载和管理不同数量级的状态。这不仅消耗网络资源,而且还增加了对网络的弹性需求。
All of the additional state used to configure and manage a larger number of devices must also be carried on the network, which means a different order of magnitude of state must be carried and managed on the network. This not only eats network resources, but it also increases resilience demands on the network.
在优化领域有几个权衡需要考虑。首先,NFV 通常将网络视为“免费资源”。每当您使某种资源看起来便宜或免费时,您就表示您更愿意使用更多的免费资源,而不是使用更昂贵的资源。如果网络成本是免费的,那么网络资源的边际效用就会下降到在决定在哪里运行特定服务以及为什么运行时不考虑网络资源的程度。将服务分散到整个网络中会带来更多的网络流量,这比将服务聚集到集群中使用更多的网络资源。
There are several tradeoffs to consider in the realm of optimization. First, NFV often treats the network as a “free resource.” Any time you make a resource appear to be cheap or free, you are saying you would prefer to use more of the free resource and less of more expensive resources. If the cost of the network is free, the marginal utility on network resources drops to the point where network resources are not considered when deciding where to run a particular service and why. Spreading services out across the network drives more traffic onto the network, which uses more network resources than by gathering services into clusters.
网络利用率不仅仅是网络上承载的带宽量与正在完成的工作量的关系。还有故障排除的效率,这直接影响平均修复时间 (MTTR),从而影响网络正常运行时间(或测量的弹性)。
Network utilization is not just about the amount of bandwidth being carried over the network versus the amount of work being done. There is also the efficiency of troubleshooting, which directly impacts the Mean Time to Repair (MTTR), and therefore the network uptime (or measured resilience).
最后,还需要考虑交互表面。将所有事情自动化,然后将自动化系统转交给基于意图的控制器来管理网络上运行的应用程序、控制平面和设备配置之间的交互,这通常听起来不错。然而,这些新的交互中的每一个也代表了一个新的交互表面,隐含了抽象的复杂性、抽象泄漏以及交互表面的其他问题。每个交互界面都需要一个应用程序编程接口 (API),随着时间的推移,这会带来管理这些 API 的复杂性。
Finally, there are interaction surfaces to consider. It often sounds good to automate everything and then turn the automation system over to an intent-based controller to manage the interaction between the applications running on the network, the control plane, and device configuration. Each of these new interactions, however, also represents a new interaction surface, with the implied complexity of abstraction, leaky abstractions, and other issues with interaction surfaces. Each of these interaction surfaces will require an Application Programming Interface (API), which introduces the complexity of managing these APIs over time.
还需要考虑其他权衡,例如在转向基于意图的网络的过程中将尽可能多的复杂性外包给供应商是否真的是一件“好事”。当复杂的问题外包时,内部技能必然会萎缩,导致企业在问题发生时没有任何本地资源可以调用。将配置、策略和意图移至单个设备可能意味着一个错误会影响更多的设备。您正在用集群设备的集中配置来交换编排系统的集中配置。这有道理吗?与所有事情一样,考虑权衡很重要。
There are other tradeoffs to consider, as well, such as whether outsourcing as much complexity as possible to a vendor in the process of moving to an intent-based network is really a “good thing.” Internal skill sets are bound to atrophy when complex problems are outsourced, leaving the business without any local resources to call on when problems happen. Moving configuration, policy, and intent to a single device can mean a single mistake impacts a lot more devices. You are trading the centralized configuration of a clustered appliance for the centralized configuration of an orchestration system. Does this make sense? As with all things, it is important to consider the tradeoffs.
NFV 和基于意图的网络是为未来定义更简单网络的两种尝试。与往常一样,问题是:“以什么方式更简单,以及以其他方式更复杂?” NFV 与服务链相结合、将网络服务从设备中分解为标准化计算资源、横向扩展服务的概念以及向自动化和意图的发展,这些都是网络工程领域的有趣趋势,这些趋势最终将为网络的设计和运营方式做出贡献。
NFV and intent-based networking are two attempts to define a simpler network for the future. The question, as always, is: “Simpler in what way—and more complex in what other ways?” NFV, combined with service chaining, the disaggregation of net-work services out of appliances and into standardized compute resources, the concept of scale-out services, and the movement toward automation and intent are all interesting trends in the network engineering world that will ultimately make contributions to the way networks are designed and operated.
下一章将讨论另一个可能对网络设计和运营方式产生重大影响的趋势:物联网。
The next chapter will discuss another trend likely to make a large impact on the way networks are designed and operated: the Internet of Things.
布卡代尔,穆罕默德。“服务功能链 (SFC) 控制平面组件和要求。” 互联网草案。互联网工程任务组,2016 年 10 月。https ://datatracker.ietf.org/doc/html/draft-ietf-sfc-control-plane-08。
Boucadair, Mohamed. “Service Function Chaining (SFC) Control Plane Components & Requirements.” Internet-Draft. Internet Engineering Task Force, October 2016. https://datatracker.ietf.org/doc/html/draft-ietf-sfc-control-plane-08.
Dolson、David、Shunsuke Homma、Diego Lopez、Mohamed Boucadair、Dapeng Liu、Ting Ao 和 Vu Anh Vu。“分层服务功能链(hSFC)。” 互联网草案。互联网工程任务组,2017 年 1 月。https ://datatracker.ietf.org/doc/html/draft-ietf-sfc-hierarchical-02。
Dolson, David, Shunsuke Homma, Diego Lopez, Mohamed Boucadair, Dapeng Liu, Ting Ao, and Vu Anh Vu. “Hierarchical Service Function Chaining (hSFC).” Internet-Draft. Internet Engineering Task Force, January 2017. https://datatracker.ietf.org/doc/html/draft-ietf-sfc-hierarchical-02.
菲尔斯菲尔斯、克拉伦斯、斯特凡诺·普雷维迪、布鲁诺·德克莱恩、斯蒂芬·利特科夫斯基和罗布·沙基尔。“分段路由架构。” 互联网草案。互联网工程任务组,2017 年 2 月。https ://datatracker.ietf.org/doc/html/draft-ietf-spring-segment-routing-11。
Filsfils, Clarence, Stefano Previdi, Bruno Decraene, Stephane Litkowski, and Rob Shakir. “Segment Routing Architecture.” Internet-Draft. Internet Engineering Task Force, February 2017. https://datatracker.ietf.org/doc/html/draft-ietf-spring-segment-routing-11.
乔尔·M·哈尔彭 (Halpern) 和卡洛斯·皮纳塔罗 (Carlos Pignataro)。服务功能链 (SFC) 架构。征求意见 7665。RFC 编辑器,2015。https: //rfc-editor.org/rfc/rfc7665.txt。
Halpern, Joel M., and Carlos Pignataro. Service Function Chaining (SFC) Architecture. Request for Comments 7665. RFC Editor, 2015. https://rfc-editor.org/rfc/rfc7665.txt.
乔恩·哈德森、劳伦斯·克里格、托马斯·纳尔滕博士、马克·拉塞尔和大卫·L·布莱克。第 3 层数据中心网络虚拟化 (NVO3) 架构。征求意见 8014。RFC 编辑器,2016。https: //rfc-editor.org/rfc/rfc8014.txt。
Hudson, Jon, Lawrence Kreeger, Dr. Thomas Narten, Marc Lasserre, and David L. Black. An Architecture for Data-Center Network Virtualization over Layer 3 (NVO3). Request for Comments 8014. RFC Editor, 2016. https://rfc-editor.org/rfc/rfc8014.txt.
K.,托马斯. “用户空间网络推动 NFV 性能。” 英特尔开发人员专区,2015 年 6 月 12 日。https: //software.intel.com/en-us/blogs/2015/06/12/user-space-networking-fuels-nfv-performance。
K., Thomas. “User Space Networking Fuels NFV Performance.” Intel Developer Zone, June 12, 2015. https://software.intel.com/en-us/blogs/2015/06/12/user-space-networking-fuels-nfv-performance.
库马尔、Surendra、Mudassir Tufail、Sumandra Majee、Claudiu Captari 和 Shunsuke Homma。“数据中心的服务功能链用例。” 互联网草案。互联网工程任务组,2017 年 2 月。https: //datatracker.ietf.org/doc/html/draft-ietf-sfc-dc-use-cases-06。
Kumar, Surendra, Mudassir Tufail, Sumandra Majee, Claudiu Captari, and Shunsuke Homma. “Service Function Chaining Use Cases in Data Centers.” Internet-Draft. Internet Engineering Task Force, February 2017. https://datatracker.ietf.org/doc/html/draft-ietf-sfc-dc-use-cases-06.
“L4-L7 服务功能链解决方案架构。” 开放网络基金会,2015 年 6 月 14 日。https: //www.opennetworking.org/images/stories/downloads/sdn-resources/onf-specifications/L4-L7_Service_Function_Chaining_Solution_Architecture.pdf。
“L4-L7 Service Function Chaining Solution Architecture.” Open Networking Foundation, June 14, 2015. https://www.opennetworking.org/images/stories/downloads/sdn-resources/onf-specifications/L4-L7_Service_Function_Chaining_Solution_Architecture.pdf.
Lasserre、Marc、Florin Balus、Thomas Morin、Nabil N. Bitar 博士和 Yakov Rekhter。数据中心 (DC) 网络虚拟化框架。征求意见 7365。RFC 编辑器,2014 年。https: //rfc-editor.org/rfc/rfc7365.txt。
Lasserre, Marc, Florin Balus, Thomas Morin, Dr. Nabil N. Bitar, and Yakov Rekhter. Framework for Data Center (DC) Network Virtualization. Request for Comments 7365. RFC Editor, 2014. https://rfc-editor.org/rfc/rfc7365.txt.
纳多、托马斯和保罗·奎因。服务功能链的问题陈述。征求意见 7498。RFC 编辑器,2015。https: //rfc-editor.org/rfc/rfc7498.txt。
Nadeau, Thomas, and Paul Quinn. Problem Statement for Service Function Chaining. Request for Comments 7498. RFC Editor, 2015. https://rfc-editor.org/rfc/rfc7498.txt.
Narten、Thomas 博士、Luyuan Fang、Eric Gray、Lawrence Kreeger、Maria Napierala 和 David L. Black。问题陈述:网络虚拟化的覆盖。征求意见 7364。RFC 编辑器,2014 年。https: //rfc-editor.org/rfc/rfc7364.txt。
Narten, Dr. Thomas, Luyuan Fang, Eric Gray, Lawrence Kreeger, Maria Napierala, and David L. Black. Problem Statement: Overlays for Network Virtualization. Request for Comments 7364. RFC Editor, 2014. https://rfc-editor.org/rfc/rfc7364.txt.
“网络功能虚拟化(NFV);持续开发和集成;关于 VNF 快照用例和建议的报告。” 欧洲电信标准协会,2017 年 3 月。http ://www.etsi.org/deliver/etsi_gr/NFV-TST/001_099/005/03.01.01_60/gr_NFVTST005v030101p.pdf。
“Network Functions Virtualisation (NFV); Continuous Development and Integration; Report on Use Cases and Recommendations for VNF Snapshot.” European Telecommunications Standards Institute, March 2017. http://www.etsi.org/deliver/etsi_gr/NFV-TST/001_099/005/03.01.01_60/gr_NFVTST005v030101p.pdf.
“网络功能虚拟化 (NFV) 第 3 版;NFV演进和生态系统;硬件互操作性要求规范。” 欧洲电信标准协会,2017 年 3 月。http ://www.etsi.org/deliver/etsi_gs/NFV-EVE/001_099/007/03.01.02_60/gs_NFV-EVE007v030102p.pdf。
“Network Functions Virtualisation (NFV) Release 3; NFV Evolution and Ecosystem; Hardware Interoperability Requirements Specification.” European Telecommunications Standards Institute, March 2017. http://www.etsi.org/deliver/etsi_gs/NFV-EVE/001_099/007/03.01.02_60/gs_NFV-EVE007v030102p.pdf.
“网络功能虚拟化 (NFV) 第 3 版;安全; 安全管理和监控规范。” 欧洲电信标准协会,2017 年 2 月。http ://www.etsi.org/deliver/etsi_gs/NFVSEC/001_099/013/03.01.01_60/gs_NFV-SEC013v030101p.pdf。
“Network Functions Virtualisation (NFV) Release 3; Security; Security Management and Monitoring Specification.” European Telecommunications Standards Institute, February 2017. http://www.etsi.org/deliver/etsi_gs/NFVSEC/001_099/013/03.01.01_60/gs_NFV-SEC013v030101p.pdf.
“网络功能虚拟化(NFV);可靠性; 关于端到端可靠性的模型和功能的报告。” 欧洲电信标准协会,2016 年 4 月。http ://www.etsi.org/deliver/etsi_gs/NFVREL/001_099/003/01.01.01_60/gs_NFV-REL003v010101p.pdf。
“Network Functions Virtualisation (NFV); Reliability; Report on Models and Features for End-to-End Reliability.” European Telecommunications Standards Institute, April 2016. http://www.etsi.org/deliver/etsi_gs/NFVREL/001_099/003/01.01.01_60/gs_NFV-REL003v010101p.pdf.
奎因、保罗和乌里·埃尔祖尔。“网络服务标头。” 互联网草案。互联网工程任务组,2017 年 2 月。https: //datatracker.ietf.org/doc/html/draft-ietf-sfc-nsh-12。
Quinn, Paul, and Uri Elzur. “Network Service Header.” Internet-Draft. Internet Engineering Task Force, February 2017. https://datatracker.ietf.org/doc/html/draft-ietf-sfc-nsh-12.
奎因、保罗和吉姆·吉查德。“服务功能链:使用网络服务标头 (NSH) 创建服务平面。” IEEE。访问日期:2017 年 4 月 21 日。https://www.opennetworking.org/images/stories/downloads/sdn-resources/IEEE-papers/service-function-chaining.pdf。
Quinn, Paul, and Jim Guichard. “Service Function Chaining: Creating a Service Plane Using Network Service Header (NSH).” IEEE. Accessed April 21, 2017. https://www.opennetworking.org/images/stories/downloads/sdn-resources/IEEE-papers/service-function-chaining.pdf.
李一舟、Lucy Yong、Lawrence Kreeger、Thomas Narten 博士和 David L. Black。“拆分 NVE 控制平面要求。” 互联网草案。互联网工程任务组,2017 年 2 月。https: //datatracker.ietf.org/doc/html/draft-ietf-nvo3-hpvr2nve-cp-req-06。
Yizhou, Li, Lucy Yong, Lawrence Kreeger, Dr. Thomas Narten, and David L. Black. “Split-NVE Control Plane Requirements.” Internet-Draft. Internet Engineering Task Force, February 2017. https://datatracker.ietf.org/doc/html/draft-ietf-nvo3-hpvr2nve-cp-req-06.
Yong、露西、奥尔德林·艾萨克、琳达·邓巴、穆罕默德·托伊和维什瓦斯·曼拉尔。“数据中心网络虚拟化覆盖网络的用例。” 互联网草案。互联网工程任务组,2017 年 2 月。https ://datatracker.ietf.org/doc/html/draft-ietf-nvo3-use-case-17。
Yong, Lucy, Aldrin Isaac, Linda Dunbar, Mehmet Toy, and Vishwas Manral. “Use Cases for Data Center Network Virtualization Overlay Networks.” Internet-Draft. Internet Engineering Task Force, February 2017. https://datatracker.ietf.org/doc/html/draft-ietf-nvo3-use-case-17.
1. 人类在哪些方面不适合执行配置管理?
1. In what ways are humans ill-suited to performing configuration management?
2. 解释物理网络功能对网络基础设施的限制。
2. Explain the constraints placed upon a network infrastructure by physical network functions.
3. 为什么需要服务功能链?
3. Why is service function chaining necessary?
4. 解释网络服务标头的用途。
4. Explain the purpose of a network service header.
5. VNF 如何帮助减少网络中断的“影响范围”?
5. How do VNFs help reduce the “blast radius” of a network outage?
6. 元数据在集中式策略管理中的用途是什么?
6. What is the purpose of metadata in centralized policy management?
7. 基于意图的网络与传统配置有何不同?
7. How is intent-based networking distinct from traditional configuration?
8. VNF 自动化的主要好处是什么?
8. What is the chief benefit of VNF automation?
9. 将网络功能从 ASIC 转移到通用 CPU 的最重要影响是什么?
9. What is the most significant impact of moving a network function from an ASIC to a general-purpose CPU?
10. Linux 内核以什么方式对 VNF 性能造成瓶颈?
10. In what way does the Linux kernel impose a bottleneck to VNF performance?
虽然云这个术语已经有了很多不同的含义,但这里将其定义为通过自助服务使用的虚拟化产品或基础设施。在云模型中,消费者使用门户或应用程序编程接口 (API) 来请求服务、平台或服务器,并在请求期间指定要求。该请求由软件自动完成,因此消费者可以立即使用所请求的服务。云服务的常见示例包括
While the term cloud has come to mean many different things, it will be defined here as virtualized products or infrastructure consumed via self-service. In the cloud model, the consumer uses a portal or Application Programming Interface (API) to requisition a service, platform, or server, specifying requirements during the request. The request is fulfilled by software automatically, so the consumer can use the requested service immediately. Common examples of cloud services include
•软件即服务(SaaS)。SaaS 是一种应用软件,消费者无需为其提供硬件,也无需购买和安装收缩包装软件。相反,该服务通过订阅来租用,并通过 Web 浏览器、邮件客户端或 SaaS 提供商提供的其他定制软件通过互联网使用。SalesForce.com、Microsoft 的 Office365 和 Google Apps for Business 都是 SaaS 的示例。
• Software as a Service (SaaS). SaaS is application software for which the consumer must neither provide hardware for nor buy and install shrinkwrapped software. Rather, the service is leased via a subscription and consumed via the Internet through a web browser, mail client, or other custom software supplied by the SaaS provider. SalesForce.com, Microsoft’s Office365, and Google Apps for Business are all examples of SaaS.
•平台即服务(PaaS)。PaaS 产品是用于创建完整软件包的软件构建块。PaaS 为软件开发人员提供了用于自己项目的空白画布,而不是作为 SaaS 提供并针对最终用户的完整软件产品。PaaS 构建块因供应商而异,但包括编程语言、开发测试环境、数据库、安全性、负载平衡、工作负载编排和数据分析等功能。
• Platform as a Service (PaaS). PaaS offerings are software building blocks used in the creation of a full software package. PaaS offers a blank canvas software developers use for their own projects as opposed to the complete software products offered as SaaS and aimed at end users. PaaS building blocks vary by vendor but include features such as programming languages, development testing environments, databases, security, load-balancing, workload orchestration, and data analytics.
•无服务器或功能即服务(FaaS)。FaaS 是托管在云环境中的按需软件例程,在接收数据输入后,执行处理并返回输出。事实上,FaaS 就像任何其他计算功能一样在服务器上运行,但不会陷入维护繁重的软件环境的困境。FaaS 最初是使用“无服务器”一词普及的,而无服务器仍然出现在大多数描述该服务的技术文献中。
• Serverless or Functions as a Service (FaaS). FaaS are on-demand software routines hosted in a cloud environment that, upon receiving a data input, perform processing and return an output. In fact, FaaS runs on servers just like any other computing function but is not bogged down with maintaining a heavy software environment. FaaS was first popularized using the term serverless, and serverless is still seen in most technical literature describing the service.
•基础设施即服务(IaaS)。IaaS 是虚拟化服务器、存储、网络和安全。IaaS 消费者请求具有特定数量的虚拟 CPU、RAM、存储等的虚拟机。消费者还可以请求安装操作系统,甚至交换、路由、负载平衡等服务。从那里, IaaS 计算可作为虚拟机来运行消费者可以在其上安装的任何软件。IaaS 为消费者提供了最大程度的软件灵活性,而无需运行物理硬件。IaaS 提供商可以被视为虚拟数据中心。
• Infrastructure as a Service (IaaS). IaaS is virtualized servers, storage, networking, and security. IaaS consumers request a virtual machine with a specific number of virtual CPUs, RAM, storage, etc. The consumer can also request an operating system to be installed, or even a service such as switching, routing, load balancing, etc. From there, the IaaS compute is available as a virtual machine to run any software the consumer can install on it. IaaS offers a consumer maximum flexibility of software without running physical hardware. IaaS providers could be thought of as virtual data centers.
SaaS、PaaS、无服务器和 IaaS 的界限没有明确定义;可以想象,一些云产品可能会归入多个“即服务”类别,而且市场变化很快。因此,过度关注这些区别是没有用的,尤其是本章,它将主要关注网络工程师在使用基于云的服务时所面临的技术问题。
The lines separating SaaS, PaaS, serverless, and IaaS are not well defined; some cloud products could conceivably be lumped into more than one “as-a-service” category, and the market changes rapidly. Because of this, it is not useful to become overly fixated on the distinctions, especially for this chapter, which will primarily focus on the technical problems faced by network engineers when working with cloud-based services.
计算云通常被认为是公共云。构成云的物理基础设施不属于使用它的组织所有。相反,公共云是由第三方创建和维护的。最知名的
Computing clouds are most often thought of as public clouds. The physical infrastructure making up the cloud is not owned by the organization consuming it. Rather, public clouds are created and maintained by third parties. The most well-known
云计算概念和挑战 公共云包括 Amazon Web Services (AWS)、Microsoft Azure (Azure) 和 Google Cloud Platform (GCP)。
Cloud Computing Concepts and Challenges public clouds are Amazon Web Services (AWS), Microsoft Azure (Azure), and Google Cloud Platform (GCP).
然而,组织没有任何技术原因不能构建托管在其自己的物理基础设施上的计算云系统。这种类型的云称为私有云。私有云是由组织构建的,旨在超越传统的计算基础设施,但受到成本、安全或隐私问题的限制,无法采用公共云。OpenStack、CloudStack 和其他编排引擎可用于创建私有云。
However, there is no technical reason an organization cannot build a computing cloud system hosted on its own physical infrastructure. A cloud of this type is called a private cloud. Private clouds are built by organizations to move beyond traditional computing infrastructure, but constrained by cost, security, or privacy concerns ruling out public cloud adoption. OpenStack, CloudStack, and other orchestration engines can be used to create private clouds.
大多数云消费组织既不是专有云用户,也不是专有云用户。相反,他们是混合云用户和运营商。他们的一些计算工作负载位于私有云中,而另一些则位于公共云中。通常,存在某种物理和虚拟网络将构成混合云环境的不同域互连。
Most cloud-consuming organizations are neither exclusively public nor private cloud users. Instead, they are hybrid cloud users and operators. Some of their compute workloads are in the private cloud, while others are in the public cloud. Typically, there is some sort of physical and virtual network interconnecting the different domains making up the hybrid cloud environment.
大多数使用公共云的混合云组织也使用多个公共云。例如,组织可能同时使用 AWS 和 Azure。这些组织被认为是多云的。
Most hybrid cloud organizations consuming public cloud also use more than one public cloud. For example, an organization might use both AWS and Azure. These organizations are said to be multicloud.
应用程序编程接口 (API) 是研究云计算的网络工程师需要考虑的一项关键因素。传统上,网络设备的配置涉及使用命令行界面 (CLI) 来创建描述设备应如何运行的配置。命令通过 CLI 输入,并且网络操作系统发布对这些命令的响应。CLI 主要是作为一个人性化的界面。输出特别适合人类阅读屏幕上的信息并理解它。
One critical consideration for network engineers studying cloud computing is Application Programming Interfaces (APIs). The configuration of network equipment traditionally involves using a command-line interface (CLI) to create configurations describing how a device is supposed to behave. Commands are entered via the CLI, and responses to those commands are posted by the network operating system. The CLI is intended primarily as a human-friendly interface. The output in particular is geared toward a human being reading the information on a screen and making sense of it.
与 CLI 不同,API 是为程序设计的。API 接受特定输入并提供结构化输出。使用 API 来配置设备的程序将提供特定的输入(可能与接口或路由协议有关),并使用该输入调用适当的 API 类和方法。API 将接受输入并对其进行适当的操作,从而配置设备。操作结果作为结构化数据返回给调用程序。结构化数据符合可预测的格式,并且可以由程序存储以供以后参考。
In contrast with the CLI, APIs are intended for programs. APIs accept specific input and provide structured output. A program using APIs to configure a device will provide a specific input, perhaps about an interface or routing protocol, and make a call to an appropriate API class and method using the input. The API will accept the input and act on it appropriately, configuring the device. The result of the operation is returned to the calling program as structured data. The structured data conforms to a predictable format and can be stored by the program for later reference.
API 用于配置和状态查询。例如,适当的 API 调用可以返回描述所有边界网关协议 (BGP) 邻居、接口计数器等状态的结构化输出。
APIs are used for configuration as well as status inquiry. For instance, the appropriate API calls can return structured output describing the state of all Border Gateway Protocol (BGP) neighbors, interface counters, and so on.
云资源通常通过 API 来消耗。对于自动化其应用程序生命周期的企业来说,API 消耗是必然的,因为程序是自动化过程的背后。对于网络工程师来说,这意味着熟悉编程、API 和自动化技术;将网络服务集成到更大的配置任务中成为一项有用的甚至是关键的技能。
Cloud resources are often consumed via APIs. For businesses automating the life-cycle of their applications, API consumption is a given, as programs are behind the automation process. For network engineers, this means familiarity with programming, APIs, and automation techniques; and integration of networking services into a larger provisioning task becomes a useful, and even critical, skill.
笔记
Note
要更深入地了解 API 和自动化,请参阅第 26 章“网络自动化案例”。
For a deeper understanding of APIs and automation, see Chapter 26, “The Case for Network Automation.”
当企业从使用内部资源转向构建云或更传统的信息技术基础设施时,它就是外包基础设施和运营。为什么企业要将其基础设施和运营外包给另一家公司?原因与许多其他外包决定没有太大不同;它们是财务和运营方面的。
When a business moves from using internal resources to building either a cloud or a more traditional information technology infrastructure, it is outsourcing infrastructure and operations. Why would a business outsource its infrastructure and operations to another company? The reasons are not much different from many other decisions to outsource; they are financial and operational.
商业成本有两种基本类型:
There are two basic kinds of costs in business:
• 资本支出(CAPEX) 是公司为了运营其销售的服务而购买的费用。这包括办公桌、建筑物和信息技术,例如路由器和交换机。
• Capital expenses (CAPEX) are what a company buys in order to operate the services it sells. This includes desks, buildings, and information technology, such as routers and switches.
• 运营费用(OPEX) 是公司根据需要支付的费用,例如人员、服务和消费品。
• Operational expenses (OPEX) are what a company pays for on an as-needed basis, such as people, services, and consumable goods.
这两者之间存在一定程度的权衡;有时您可以购买可以降低运营成本的设备。当然,其他时候,设备采购也会增加运营成本。购买云服务将处理能力的成本从资本支出和运营支出的混合转变为完全由运营支出组成;通过外包,除了支持园区内或云服务的连接之外,企业无需担心购买任何类型的网络或服务器设备。从资本支出转向运营支出有助于业务运营,因为它可以带来更顺畅、更可预测的现金流。
There is some amount of tradeoff between these two; sometimes you can buy equipment that will reduce OPEX. Other times, of course, equipment purchases add to OPEX. Purchasing cloud services moves the cost for processing power from being a mix of CAPEX and OPEX to being entirely OPEX; by outsourcing, the business does not need to worry about buying any sort of network or server gear other than to support connectivity within campuses or to the cloud service. Moving from CAPEX to OPEX is helpful to business operations because it results in smoother, more predictable cash flow.
企业通常还认为,大型云提供商可以比小公司更便宜地构建和管理计算资源,因为他们可以进行批量购买,并且拥有人员充足的设计和硬件团队。具体来说,大型网络运营商可以利用白盒硬件和开源软件,以比不专门提供计算资源的公司更低的成本提供服务。在任何特定情况下,这可能是也可能不是,具体取决于给定生态系统的成熟度、对网络的实际要求以及企业寻求突破性解决方案的意愿。
Businesses often assume, as well, that large-scale cloud providers, because they have access to bulk buying and deeply staffed design and hardware teams, can build and manage computing resources more cheaply than a smaller company can. Specifically, large-scale network operators can take advantage of white box hardware and open source software to provide services at a lower cost than a company not specializing in providing computing resources. This may, or may not, be true in any particular situation, depending on the maturity of a given ecosystem, the actual requirements placed on the network, and the willingness of the business to look outside the box for solutions.
OPEX 的金额随着云运营商消耗的资源量而变化,以各种指标来衡量,包括 CPU 周期和传输的网络位。由于人员配置或咨询要求的变化,运营成本可能会进一步变化。将基础设施建设外包给公共云运营商可能会减少对某些内部专业知识的需求。与 PaaS 或 IaaS 相比,SaaS 产品更是如此。
The amount of OPEX varies with how much of the cloud operator’s resources are consumed, measured in a variety of metrics including CPU cycles and network bits transferred. OPEX might further vary due to changes in staffing or consulting requirements. It is possible that outsourcing infrastructure building to public cloud operators reduces the need for certain kinds of in-house expertise. This will be more true for SaaS offerings than PaaS or IaaS.
一些企业可能会发现公共云的消费比运营自己的私有基础设施更昂贵。事实上,一些组织已将工作负载从公共云转移回私有基础设施,以降低运营成本。Serverless 是公有云运营商针对这一问题的一种应对措施。利用 FaaS 的应用程序而不是依赖于成熟的虚拟机或长期存在的容器,成本降低了 80%。
Some businesses might find public cloud consumption is more expensive than operating their own private infrastructure. In fact, some organizations have shifted workloads back from public cloud to private infrastructure to lower OPEX costs. Serverless is one response by public cloud operators in response to this problem. Applications leveraging FaaS instead of sitting on full-blown virtual machines or long-lived containers see as much as an 80% cost reduction.
公共云为企业提供了通过信用卡刷卡的计算基础设施,将采购时间从几周或几个月缩短到几分钟。这种对基础设施和信息技术的即时访问可以使企业能够非常快速地将产品和解决方案推向市场。这种即时访问信息技术的另一个用途是能够在业务周期的高峰期间转移负载,因此企业不会因为无法扩展而失去机会。最初在第 1 章“基本概念”中使用的业务与基础设施支出图表提供了一个很好的例子,说明云计算可以在哪些方面对业务敏捷性有用。如图 28-1所示。
Public cloud gives businesses computing infrastructure with the swipe of credit card, reducing procurement times from weeks or months to minutes. This sort of immediate access to infrastructure and information technology can enable a business to bring products and solutions to market very quickly. Another use for this immediate access to information technology is the ability to shift load during a peak in the business cycle, so the business does not lose opportunity because of an inability to scale. The business versus infrastructure spending chart originally used in Chapter 1, “Fundamental Concepts,” provides a good example of where cloud computing can be useful for business agility; Figure 28-1 illustrates.
为了避免深灰色标记为失去商机的时代,许多企业将持续过度建设其基础设施。周期性企业在试图应对消费者行为同时管理这种增长曲线方面处于更糟糕的境地。企业可以使用公共云计算平台作为临时爆发的资源(例如寒假前几周的零售业务),从而平衡其信息处理,而不是购买足够的网络和计算资源来始终满足需求随着时间的推移购买。
In order to avoid the times marked in dark gray as lost business opportunity, many businesses will consistently overbuild their infrastructure. Cyclical businesses are in the worse position of trying to cope with consumer behavior while also managing this kind of growth curve. Rather than buying enough network and compute resources to stay consistently above demand, businesses can use public cloud computing platforms as a resource on which to temporarily burst (such as a retail operation in the few weeks before the winter holidays), leveling out their information processing purchases over time.
虽然使用上一节介绍的公共云服务有很多优点,但也有一些权衡——销售服务的客户团队要么不会告诉您,要么会尽量减少挑战。然而,在决定将处理转移到公共云时,真正考虑这些权衡非常重要。公共云的好处既不明显,也不是定论,因为它随着业务情况的不同而变化。本节将考虑企业和工程师在考虑将处理迁移到公共云服务时需要考虑的一些各种权衡。
While there are advantages to using public cloud services, covered in the previous section, there are also tradeoffs—those little things the account team selling the services is either not going to tell you about or is going to minimize as challenges. It is important, however, to really consider these tradeoffs when making the decision to move processing to a public cloud. The benefit of public cloud is neither obvious nor a foregone conclusion, as it varies with the business situation in question. This section will consider some of the various tradeoffs that businesses and engineers need to consider when considering moving their processing to public cloud services.
迁移到云的一个常见原因是通过减少构建和部署应用程序所需的网络和基础设施工程资源的数量来降低内部运营成本。由于基础设施资源可以在公有云环境中通过 API 来使用,因此不再需要运营团队。构建应用程序的开发人员可以利用他们的编程技能来构建应用程序并配置应用程序运行的基础设施。这可能会吸引那些试图节省运营费用的企业。因此,不仅提供计算的所有成本都从运营支出转移到了资本支出,而且运营支出的总量即使没有减少,至少也保持稳定。如果应用程序开发人员能够完成基础设施工程师的工作,这似乎可以节省大量成本。然而,
A common reason for moving to cloud is to reduce internal operational costs by reducing the number of network and infrastructure engineering resources required to build and deploy applications. Because infrastructure resources can be consumed via APIs in a public cloud environment, there is no longer any need for an operations team. Developers building an application can use their programming skills to build the application and to provision the infrastructure the application runs on. This might appeal to businesses trying to conserve operational expenses. So not only are all the costs of providing computing shifted from OPEX to CAPEX, the total amount of OPEX is at least held steady, if not reduced, as well. If application developers can do the job of infrastructure engineers, this seems to present an attractive cost savings. However, this “NoOps” view of infrastructure is short-sighted for several reasons as described in the sections that follow.
可以肯定的是,基础设施工程师在基础设施配置方面是有能力的。然而,以满足业务目标的方式配置基础设施是复杂的。业务需求驱动特定的基础设施配置决策。
Infrastructure engineers are competent in infrastructure provisioning, to be sure. However, provisioning infrastructure in a way that meets business objectives is complex. Business requirements drive specific infrastructure provisioning decisions.
例如,企业可能与客户签订了特定的服务级别协议 (SLA)。为了满足这些 SLA,组织将为其应用程序提供匹配的弹性要求。尽管特定区域发生灾难性故障,但应用程序可能仍需要可用。应用程序可能必须在特定时间内响应用户请求。
For example, a business might have specific service level agreements (SLAs) it has made with its customers. To meet those SLAs, the organization will have matching resiliency requirements for their application. The application might need to be available despite a catastrophic failure in a specific region. The application might have to respond to user requests in a specific amount of time.
此类要求需要对基础设施设计有敏锐的了解。为了保持可用性,必须做出任意数量的基础设施决策:
These sorts of requirements require a keen understanding of infrastructure design. To maintain availability, any number of infrastructure decisions must be made:
• 网络工程师可能需要对流量应用特定的安全策略。
• The network engineer might need to apply specific security policies to a traffic flow.
• 必须调整网络容量以满足需求。
• Network capacity must be sized to meet demand.
• 可能需要多个云之间的连接,具体取决于应用程序调用的资源以及这些调用的来源和来源。
• Connections between multiple clouds might be required, depending on what resources an application calls on and where those calls are going to and from.
• 将数据复制到灾难恢复站点将需要连接性和容量,以及路由和可能的地址转换服务,以确保主站点中断期间应用程序的可用性。
• Data replication to a disaster recovery site will require connectivity and capacity, as well as routing and possibly address translation services to ensure application availability during a primary site outage.
简而言之,光不是来自电灯开关。虽然拨动电灯开关可以打开灯,但能够操作电灯开关意味着不需要了解电源、电路、断路器、地线甚至灯泡。然而,所有这些组件对于灯的功能都至关重要。
In short, light does not come from a light switch. While flipping a light switch turns on the lights, being able to operate a light switch implies no knowledge of electrical supply, electrical circuits, circuit breakers, ground wires, or even light bulbs. And yet, all those components are critical to the functioning of the light.
虽然简单的故障可以通过扔掉损坏的设备或软件并用新的设备或软件替换来解决,但许多基础设施故障并不简单。基础设施,尤其是为弹性而设计的基础设施,往往很复杂。事情越复杂,出现问题的可能性就越大,导致的问题也就越微妙。
While simple failures can be overcome by throwing out a broken piece of equipment or software, and replacing it with a new one, many infrastructure failures are not simple. Infrastructure, especially infrastructure designed for resiliency, tends to be complex. The more complex something is, the greater the likelihood is that something can go wrong, and the more nuanced the resultant problem might be.
解决复杂问题需要深厚的专业知识。在云计算时代,仍然需要在复杂的大规模网络的管理和故障排除方面具有深厚专业知识的基础设施工程师。基础设施工程师知道如何在开关不再起作用时重新打开灯。
Troubleshooting complex problems requires deep expertise. In the cloud computing era, there is still a need for infrastructure engineers with deep expertise in managing and troubleshooting complex large-scale networks. Infrastructure engineers know how to bring the lights back on when flipping the switch no longer works.
许多企业只是“假设”一旦他们将处理转移到云端,他们将能够“通过互联网”访问处理资源。现实情况是,移动大量数据仍然需要花费大量资金。仍然需要购买和维护电路、必须配置服务质量、必须配置和维护用于“登陆”数据的本地资源等。这些电路的成本很可能会通过任何类型的云迁移而增加,并且应该在混合云或多云部署的情况下,这是一个特别值得关注的领域。抖动和延迟也是运营成本的组成部分;这些都是一个真正令人担忧的问题,因为提供商的物理基础设施可能与您的业务运营不一致。
Many businesses just “assume” that once they have moved their processing to the cloud, they will be able to reach the processing resources “over the Internet.” The reality is moving a lot of data can still cost a lot of money. Circuits still need to be purchased and maintained, Quality of Service must be configured, local resources to “land” the data must be configured and maintained, etc. The costs of these circuits will most likely increase through any kind of cloud migration, and should be an area of particular concern in the case of hybrid- or multicloud deployments. Jitter and latency are also components of cost in operations; these are a real concern because the provider’s physical infrastructure may not align with your business operations.
虽然云计算通常可以以比购买、安装和维护本地计算资源低得多的成本提供通用处理资源,但云的理论基础是尽可能以相同的方式处理每个资源和每个问题。例如,您最初可能假设单个网络设备(例如数据中心结构路由器)可以替换为云中的单个处理器。在这种情况下,可能需要使用云中的 20 或 30 个处理器来替换该单个设备,从而导致云部署的成本大大高于预期。专用硬件导致采购和维护成本更高,但降低了实际处理数据的成本;公共云通常只是简单地用大量更通用的资源替换专用硬件,
While cloud computing can often provide generic processing resources at a much lower cost than buying, installing, and maintaining local compute resources, the theory of cloud is grounded in treating every resource and every problem in as much the same way as possible. For instance, you might initially assume a single network device, such as a data center fabric router, can be replaced with a single processor in the cloud. In this case, 20 or 30 processors in the cloud might need to be used to replace this single device, driving the cost of the cloud deployment considerably higher than expected. Specialized hardware drives purchasing and maintenance costs higher, but it drives down the cost of actually processing data; public clouds often simply replace specialized hardware with a large number of more generic resources, shifting costs in unexpected ways.
网络工程中有一个普遍的看法,即未使用的功能对网络的运行是无声的和中性的——只要不配置某个功能,它就不会造成任何损害。然而,现实却大不相同。网络设备或基于云的服务中可用的每个功能都代表一定数量的代码,这些代码必须与提供其他配置的、使用中的功能的代码进行交互。这些功能以及它们所代表的代码是导致意外后果、潜在安全漏洞和更大攻击面的完美途径。当然,这个问题并不是云服务所独有的。每个供应商都会添加源源不断的功能,其中大多数功能是任何特定客户都不会使用的。
There is a common perception in network engineering that unused features are silent and neutral to the operation of the network—so long as a feature is not configured, it is doing no harm. The reality, however, is far different. Each feature available in a network device, or a cloud-based service, represents some amount of code—code that must interact with the code providing other configured, in-use features. These features, and the code they represent, are perfect gateways into failures through unintended consequences, potential security holes in waiting, and a larger attack surface. This problem is not unique to cloud services, of course; every vendor will add a constant stream of features, most of which any particular customer will not use. This does not mean these features have no effect on the performance or stability of the product you are using, however.
在尝试了解迁移到公共云进行处理的全部成本时,运营权衡并不是唯一需要考虑的领域;还有一些商业权衡需要研究。
Operational tradeoffs are not the only area to consider when trying to understand the full cost of moving to public clouds for processing; there are also business tradeoffs to study.
充分利用云计算需要企业重新考虑其运营和业务流程。首先,企业基于传统基础设施解决方案构建的应用程序可能需要进行大量重新设计,以最大限度地提高其在云计算环境中的效率。其次,企业围绕传统计算基础设施构建的运营流程需要更新以支持云计算。
Taking full advantage of cloud computing requires a business to rethink its operational and business processes. First, applications that businesses have built on traditional infrastructure solutions may require significant redesign to maximize their efficiency in a cloud computing environment. Second, operational processes that businesses have built around traditional computing infrastructure will need to be updated to support cloud computing.
运营流程和应用程序架构的转变是一些企业由于其固有成本而避免的重大变化。这些企业试图尽可能地复制他们的传统基础设施架构和操作,只是用公共云镜像中的反射来替换他们自己的硬件。这种方法类似于“将方钉装入圆孔”。当用足够尺寸的锤子促使方钉装入圆孔时,可以采用这种方法,但结果不优雅且效率低下。
Shifts in operational processes and application architectures are significant changes that some businesses have avoided due to their inherent costs. These businesses have tried to replicate as closely as possible their traditional infrastructure architecture and operations, merely replacing their own hardware with a reflection cast in the public cloud mirror. This approach is analogous to “fitting a square peg into a round hole.” It is possible to take this approach when the square peg is motivated to fit into the round hole with a sufficiently sized hammer, but the result is inelegant and inefficient.
这指出了许多企业在外包时没有考虑到的一个更大问题:外包商的目标与企业本身的目标不匹配。外包业务的目标是生产消费者想要购买的产品或服务;外包商的目标是以尽可能高的利润生产出一种可供企业尽可能多地消费的产品。外包商很可能为了增加外包商的收入和利润而将内部业务决策推向不利于外包业务的方向。
This points to a larger problem many businesses do not consider when outsourcing: the mismatch between the goals of the outsourcer and the goals of the business itself. The goal of the outsourcing business is to produce a product or service that consumers want to purchase; the goal of the outsourcer is to produce a product that the business will consume as much as possible of, at the highest possible margin. It is quite possible for the outsourcer to drive internal business decisions in a direction that is not good for the outsourcing business in order to increase the outsourcer’s revenue and margins.
例如,云提供商(以及所有其他供应商 - 公共云提供商在这方面并不是唯一的)添加新的特性和功能是很常见的,他们可以使用它们来促使客户支付更多费用,无论这实际上改善了客户的业务还是不,并将客户锁定在云提供商的产品线中。
For instance, it is common for cloud providers (and all other vendors—public cloud providers are not unique in this regard) to add new features and functions they can use to leverage their customers into paying more, whether it actually improves their customer’s business or not, and locks the customer in to the cloud provider’s product line.
在大多数业务环境中,供应商锁定问题尤其严重。当企业承诺使用特定的云供应商时,该企业的运营流程就会被锁定到特定供应商如何提供其技术。转向不同的供应商变得很困难,因为目标供应商可能以不同的方式提供其技术,即使所讨论的技术本质上是相同的服务。
The vendor lock-in problem is particularly acute in most business environments. When a business commits to using a specific cloud vendor, that business’s operational processes become locked into how a specific vendor delivers its technology. Moving to a different vendor becomes hard, because the target vendor probably delivers its technology differently, even if the technology in question is essentially the same service.
从网络角度来看,云计算在供应商锁定的背景下并没有带来什么新鲜事。几十年来,网络供应商提供的产品在功能上差异有限,但消费模式却截然不同。有时,底层技术不同,但提供的结果相同。其他时候,技术是相同的,基于行业标准,但以独特的方式配置。而有时,供应商提供其他任何人都无法提供的真正差异化的服务。
From a networking perspective, cloud computing presents nothing new in the context of vendor lock-in. For decades, networking vendors have delivered products with limited differentiation in functionality, but via widely different consumption models. Sometimes the underlying technology is different while delivering the same result. Other times, the technology is identical, based on industry standards, but configured in unique ways. And yet other times, vendors offer truly differentiated services unavailable from anyone else.
云计算中可用的网络服务并没有打破既定的范式。所有供应商都提供一些服务基准,但这些服务可以独特地使用。有些可能会提供特殊功能以使他们的产品与众不同。网络工程师面临的挑战与以往没有什么不同,需要仔细理解技术的功能以及对业务问题的适用性。
The networking services available in cloud computing don’t break the established paradigm. All vendors offer some baseline of services, but these services can be consumed uniquely. Some might offer special features to set their product apart. The challenge for network engineers is no different than it ever has been, requiring careful comprehension of the technology’s capabilities and applicability to a business’ problems.
全力消耗云提供商的风险:一些云提供商过去通过与客户的合作伙伴关系来学习如何构建和支持特定的业务模型,然后利用这些经验作为直接竞争对手进入市场给他们自己的客户。为独特的企业提供服务对于云提供商来说是一个很好的孵化策略,可以为他们所支持的客户提供内部类似服务,最终扩大他们的市场范围。
The risk of the all-consuming cloud provider: Some cloud providers have, in the past, used a partnership with a customer to learn how to build and support a particular business model, and then used the experience to enter the market as a direct competitor to their own customer. Providing services for unique businesses can be a great incubation strategy for cloud providers to spin up internal analogs to the customers they are supporting, eventually broadening their market reach.
对于网络工程师来说,云计算提出了使用物理和虚拟设备的组合通过公共和私人传输提供低延迟、安全连接的挑战。此外,随着工作负载的建立和拆除、以编程方式使用和集中监控,这种基于云的出色传输服务还必须根据需要实时配置和取消配置。
For the network engineer, cloud computing presents the challenge of providing low-latency, secure connectivity over a mix of public and private transports using a mix of physical and virtual equipment. In addition, this marvelous cloud-based transport service must also be provisioned and deprovisioned on demand in real time as workloads are stood up and torn down, consumed programmatically, and monitored centrally.
当您考虑如何在云环境中部署应用程序时,工作负载放置变得特别有趣。假设企业正在多云环境中部署应用程序。在这种情况下,工作负载可以放置在一个或多个公共云以及私有云中。
When you are considering how applications are deployed in cloud environments, workload placement becomes especially interesting. Assume an enterprise is deploying an application in a multicloud environment. In this scenario, workloads can be placed in one or more public clouds, as well as in a private cloud.
开发人员经常将单个应用程序分解为微服务,其中应用程序的每个组件都被分成独立的服务。然后,应用程序被重构为一组通过网络相互通信的服务,以支持与原始应用程序相同的整体服务集。
Developers often break a single application up into microservices, where each component of the application is separated out into a standalone service. The application is then reconstituted as a set of services communicating with one another across the network to support the same overall set of services as the original application.
微服务架构面临的问题是延迟。当以公里而不是米为单位进行通信时,数据包穿越该距离所需的时间以毫秒而不是亚毫秒为单位;需要两次这样的网络行程,即往返时间 (RTT) 来完成构成应用程序的微服务之间的任何事务。由于多个微服务必须进行交互才能产生与原始整体应用程序相同数量的数据,因此这些延迟“叠加”产生的总延迟比许多开发人员预期的要大得多。图 28-2说明了这一点。
The problem that microservices architectures face is latency. When communicating over distances measured in kilometers rather than meters, the time it takes packets to traverse the distance is measured in milliseconds instead of submilliseconds; it takes two such trips across the network, the Round Trip Time (RTT) to complete any transaction between the microservices making up an application. Since multiple microservices must interact to produce the same amount of data as the original, monolithic, application, these delays “stack up” to produce a total delay much greater than many developers expect. Figure 28-2 illustrates.
在图28-2中,A向单体应用程序请求一些信息;整个网络处理请求并返回信息的RTT为20ms。当 B 请求相同的信息时,服务 1 必须向服务 2 请求信息,服务 2 必须向服务 3 请求信息,依此类推。微服务情况下的总网络时间为 80 毫秒。如果由于任何原因导致网络延迟增加,在微服务情况下,效果会乘以四,因为每个服务请求涉及四个 RTT。
In Figure 28-2, A requests some information from the monolithic application; the RTT across the network for processing the request and returning the information is 20ms. When B requests this same information, service 1 must request information from service 2, which must request information from service 3, etc. The total network time in the microservices case is 80ms. If there is any increase in the delay across the network for any reason, the effect is multiplied by four in the microser-vices case, because there are four RTTs involved in every service request.
在以前部署在传统基础设施上或完全包含在本地化私有云中的应用程序中,延迟远不是一个问题。然而,由许多组件(例如微服务)组成并分布在各种云上的应用程序可能会因延迟而出现性能下降。
In applications previously deployed on traditional infrastructure or fully contained in a localized private cloud, latency is far less of a concern. However, applications composed of many components, such as microservices, and spread over a variety of clouds can experience reduced performance due to latency.
对于面临这个问题的网络工程师来说,至少有两种解决方案。
For network engineers faced with this problem, at least two solutions present themselves.
与应用程序部署团队合作优化工作负载放置。出于各种原因(包括容量、成本和功能),将工作负载放置在特定的云中。对于网络工程师来说,关键是参与应用程序设计,以便有关工作负载布置的决策包括清楚地了解这些布置选择对基础设施的影响。应用程序开发人员应与基础设施工程师和业务利益相关者一起共同做出设计决策。
Work with application deployment teams to optimize workload placement. Workloads are placed in specific clouds for a variety of reasons, including capacity, cost, and functionality. For network engineers, the key is to be involved in the application design so decisions about workload placement include a clear understanding of the infrastructure implications of those placement choices. Application developers in conjunction with infrastructure engineers and business stakeholders should make design decisions jointly.
每一个设计都是技术理想主义和实用实用主义之间的妥协。例如,对于特定设计来说,网络延迟可能是可接受的折衷方案,因为整体应用程序性能不会受到足够大的影响。另一方面,在不了解基础设施实际情况的情况下部署的复杂应用程序可能会遭受不可接受的性能损害。
Every design is a compromise between technical idealism and practical pragmatism. For example, network latency might be an acceptable compromise for a particular design, because the overall application performance is not impacted materially enough. On the other hand, complex applications deployed in ignorance of infrastructure realities might suffer unacceptable performance compromises.
让云层靠得更近。许多数据中心提供云交换服务,客户通常可以通过云交换购买到公共云提供商的直接链接。这意味着网络工程师可以通过设计网络使云更紧密地结合在一起,从而最大限度地减少延迟的影响。
Bring clouds closer together. Many data centers offer cloud exchange services, where customers can purchase direct links to public cloud providers, often through a cloud exchange. This means a network engineer can minimize the impact of latency by designing the network to bring clouds closer together.
这些服务是有成本的,并且需要有针对性的路由设计。建立与公共云的直接连接的一个常见挑战是,所涉及的 IP 地址块既可以通过公共互联网访问,也可以通过现在新引入的云交换电路访问。必须填充路由表,以便通过云交换转发流量,同时还要避免非对称路由。
These services come at a cost and require a purposeful routing design. A common challenge in standing up direct connections to public clouds is that the IP address blocks in question are accessible both via the public Internet and now via the newly introduced cloud exchange circuit. Routing tables must be populated so traffic is forwarded via the cloud exchange, while also avoiding asymmetric routing.
当您将现有应用程序迁移到公共云 IaaS 时,第二个问题出现在存储方面。如何将本地数据中心的应用程序数据迁移到公共云中,以便应用程序可以访问新环境中的数据?
When you are moving an existing application to public cloud IaaS, a second problem comes in the form of storage. How is the application data living in a local data center moved into the public cloud so the application has access to the data in the new environment?
对于网络工程师来说,这种类型的挑战并不新鲜。将大量数据从一个点移动到相隔一定距离的另一个点是一个约束问题。首先,两个地理位置不同的点之间的带宽量通常仅限于数据中心可用带宽的一小部分。其次,延迟可能导致难以使用全部可用带宽来执行传输。
For network engineers, this type of challenge is not a new one. Moving large amounts of data from one point to another separated by distance is a problem of constraints. First, the amount of bandwidth between two geographically diverse points is typically limited to a fraction of the bandwidth available in a data center. Second, latency can make it difficult to use the entirety of the bandwidth available to execute the transfer.
在局域网 (LAN) 中,电路的带宽非常高,通常以 10、25、40、50 甚至 100Gbps 的速度将主机互连到网络。在 LAN 场景中,在网络中移动存储数据时,带宽通常不是一个限制。传输过程中的瓶颈更有可能出现在磁盘或主机总线子系统中。
In a local area network (LAN), circuits are very high bandwidth, commonly interconnecting hosts to the network at speeds of 10, 25, 40, 50, and even 100Gbps. In the LAN scenario, bandwidth generally is not a constraint when moving storage data around the network. Bottlenecks in the transfer process are more likely to be found in the disk or host bus subsystems.
然而,当存储传输发生在公共互联网等广域网 (WAN) 上时,带宽通常会成为限制,因为瓶颈从主机数据总线或磁盘本身转移回网络。互连私有云和公共云的电路通常低于 10Gbps。此外,与 LAN 相比,该连接可能会有损耗,需要重新传输并降低总体吞吐量。这是网络工程师在计算将存储卷从本地数据中心转移到公共云所需的时间时必须考虑的因素之一。
However, when the storage transfer is happening over a wide area network (WAN) such as the public Internet, bandwidth often becomes a constraint, as the bottleneck moves from host data bus or disk itself back to the network. Circuits interconnecting private and public clouds are very often less than 10Gbps. In addition, the connection might be lossy when compared to a LAN, requiring retransmissions and reducing overall throughput. This is one element network engineers must consider when computing how long it will take to move a storage volume from the local data center to the public cloud.
除了带宽限制之外,延迟一直是一个潜在的限制。假设传输控制协议 (TCP) 是传输机制,则通过 WAN 等待确认的时间意味着可能难以填满可用带宽。对于高带宽、高延迟网络(即所谓的长胖网络 (LFN))来说,这是一个众所周知的问题。
In addition to bandwidth constraints, latency has historically been a potential constraint. Assuming the Transmission Control Protocol (TCP) is the transfer mechanism, the amount of time waiting for an acknowledgment across the WAN means it might be difficult to fill the available bandwidth. This is a well-known issue for high-bandwidth, high-delay networks—so-called long fat networks (LFNs).
然而,充分利用 LFN 可用带宽的挑战已通过 TCP 协议的多种调整技术和变体得到解决。例如,BIC-TCP、TCP Westwood、TCP Reno(有多个变体)、TCP Hybla 和 TCP Vegas 都是核心 TCP 拥塞控制算法的算法变体,修改与往返时间相关的窗口大小以最大化吞吐量。另外值得注意的是,CUBIC TCP 最近引起了 IETF 的关注。
However, the challenge of fully utilizing the available bandwidth of LFNs has been addressed with several tuning techniques and variants to the TCP protocol. For instance, BIC-TCP, TCP Westwood, TCP Reno (with several variants), TCP Hybla, and TCP Vegas are all algorithmic variants of the core TCP congestion control algorithm, modifying window size in relation to round trip time to maximize throughput. Also notable, CUBIC TCP has seen recent attention in the IETF.
需要记住的一点是,通过公共互联网上的复制操作向远程存储卷填充 TB 级的数据将比在本地执行类似的复制花费更多的时间。这引入了一个决策点。性能是否足够,可以通过网络传输完成复制?或者应该将数据复制到本地便携式介质上,然后发送到远程公共云?
The point to keep in mind is populating a remote storage volume with terabytes of data via a copy operation across the public Internet will take more time than a comparable copy performed locally. This introduces a decision point. Is the performance sufficient enough so the copy can be done via network transfer? Or should data be copied onto a local, portable media and then shipped to the remote public cloud?
在这种情况下,没有任何魔法可以使一个地方的 TB 数据立即出现在另一个地方。因此,这个问题是了解可用技术的实际局限性并与企业合作确定正确行动方案的一个很好的例子。
In a situation like this, there is no magic available to make terabytes of data in one place appear in another instantly. As such, this problem is a good example of under-standing the practical limitations of the available technology and working with the business to determine the proper course of action.
一旦数据被填充到远程云存储中,将数据从云中移回就会带来挑战。有一个问题是实际问题:成本。虽然公共云提供商对其客户签入数据非常感兴趣,但他们不希望这些客户离开。因此,公共云提供商收取的费用高达将数据移回的摄取传输成本的三到五倍。这通常称为数据引力问题。
Once data has been populated in the remote cloud storage, moving data back out of the cloud presents challenges. One issue is a practical one: cost. While public cloud providers are keenly interested in their customers checking data in, they don’t want those customers to leave. Thus, public cloud providers charge as much as three to five times the ingestion transfer costs to move data back out. This is commonly known as the data gravity problem.
数据引力不是网络问题,而是网络工程师应该意识到的业务问题。对于专注于技术挑战的网络工程师来说,将大量存储数据移出公共云会带来与将数据移入云相同的挑战。有限的带宽和延迟会带来限制,可能会增加传输时间,这对于企业来说是不可接受的。
Data gravity is not a networking concern, but rather a business problem that network engineers should be aware of. For network engineers focused on the technology challenge, moving large amounts of storage data out of a public cloud presents the same challenges as moving the data into the cloud in the first place. Limited bandwidth and latency introduce constraints that might increase transfer times unacceptable to the business.
虽然一些组织将使用云交换连接到公共云服务,但大多数组织将通过互联网连接到公共云。互联网线路成本下降,实现多个互联网连接在网络边缘负担得起。这为网络工程师提供了一个有趣的网络设计选项。多个连接不是边缘的单个互联网连接,而是提供弹性和额外的带宽。
While some organizations will connect to public cloud services using cloud exchanges, most organizations will connect to the public cloud via the Internet. Internet circuit costs have come down in price, making multiple Internet connections at the network edge affordable. This offers network engineers an interesting network design option. Rather than a single Internet connection at the edge, multiple connections offer both resiliency and additional bandwidth.
挑战在于,究竟如何利用多个互联网边缘电路?直接而明显的答案是通过路由协议。在互联网边缘的情况下,路由协议是BGP。然而,虽然 BGP 支持使用多个 Internet 连接,但 BGP 的最佳路径算法侧重于连接性,而不是应用程序体验质量。BGP 只能区分一条路径与另一条路径的相对紧密程度,而不能区分较长的路径是否可能具有更好的质量。
The challenge is how, exactly, to leverage multiple Internet edge circuits? The straightforward and obvious answer is via a routing protocol. In the case of the Internet edge, the routing protocol is BGP. However, while BGP enables the use of multiple Internet connections, BGP’s best path algorithm is focused on connectivity and not quality of application experience. BGP can only distinguish the relative closeness of one path versus another, and not whether a longer path might be better quality.
由于 BGP 不够细致,无法在每个应用程序级别做出最佳路由决策,因此称为软件定义 WAN (SD-WAN) 的市场利基最近在业界占据了主导地位。SD-WAN 解决方案通常是供应商炮制的专有转发方案。SD-WAN 转发方案优先考虑特定应用程序的体验质量 (QoE),并根据网络工程师定义的 QoE 策略做出转发决策。
Since BGP is insufficiently nuanced to make optimal routing decisions at a per-application level, a market niche known as Software-Defined WAN (SD-WAN) has taken recent hold in the industry. SD-WAN solutions are typically proprietary forwarding schemes concocted by vendors. SD-WAN forwarding schemes prioritize quality of experience (QoE) for specific applications, and make forwarding decisions based on the QoE policy defined by a network engineer.
在访问公共云的情况下,SD-WAN转发方案将确定要使用的最佳互联网线路,以便为云消费者提供最佳服务。例如,SD-WAN 转发器可能(据称)确定互联网线路 A 最适合访问 Microsoft Office 365 SaaS 云,而互联网线路 B 最适合 Amazon Web Services IaaS 托管工作负载。
In the case of accessing the public cloud, an SD-WAN forwarding scheme will determine the best Internet circuit to use to provide the best service to the cloud consumer. For example, an SD-WAN forwarder might (allegedly) determine Internet circuit A is best to access the Microsoft Office 365 SaaS cloud, while Internet circuit B is best for Amazon Web Services IaaS hosted workloads.
虽然许多 SD-WAN 供应商在该领域提供的产品是独一无二的,但做出最佳转发决策可能包括以下决策点:
Although unique to the many SD-WAN vendors offering products in this space, making a forwarding decision about what is best might include the following decision points:
1.电路损耗。电路是否丢包?如果有,达到什么程度?对于某些流量来说,丢失更容易接受,例如大文件传输,其中恢复可确保数据完整性。对于实时语音等流量来说,丢失是不可接受的,因为对话会受到影响。
1. Circuit lossiness. Is a circuit dropping packets? If so, to what degree? Loss will be more acceptable to some traffic, such as large file transfers, where recovery ensures data integrity. Loss will be unacceptable to traffic such as real-time voice, where a conversation will be impacted.
2.电路抖动。电路是否以可预测的时间间隔传送数据包?与丢失一样,抖动(数据包传送之间的时间增量差异)是否可接受,取决于数据包有效负载。
2. Circuit jitter. Is a circuit delivering packets on predictable time intervals? Like loss, jitter—a variance in the time delta between packet deliveries—is acceptable or not, depending on the packet payload.
3.电路负载。给定电路的繁忙程度如何?SD-WAN 解决方案可以选择通过负载较小的电路发送流量,以提高流量的 QoE。
3. Circuit load. How busy is a given circuit? SD-WAN solutions can choose to send traffic over a less loaded circuit to improve QoE for the traffic.
SD-WAN 产品将路由设计和管理从网络工程师或路由协议手中解放出来,将这些问题转移到软件上。对于与公共云的连接,这意味着最终用户 QoE 会不断优化,而网络工程师无需对路由系统进行不寻常的调整。这种方法的另一个好处是能够以最少的工程量随意向方案中添加和删除互联网边缘电路。
SD-WAN products take the routing design and administration out of the hands of the network engineer or the routing protocol, moving those concerns to software. For connectivity to public cloud, this means the end user QoE is optimized constantly, without the network engineer having to make unusual tweaks to the routing system. This approach has the added benefit of being able to add and remove Internet edge circuits to the scheme at will with a minimum of engineering.
SD-WAN 解决方案的缺点是它们是专有的。虽然网络行业很早就就如何使 SDWAN 解决方案具有互操作性进行了一些讨论,但市场还处于新生阶段且不稳定,在 SD-WAN 标准化方面尚未取得进展。相反,市场关注的是产品整合和客户增长。
The downside of SD-WAN solutions is they are proprietary. While there have been some very early conversations in the networking industry about making SDWAN solutions interoperable, the market is too nascent and unstable to have seen progress in SD-WAN standardization. The market is focused instead on product consolidation and customer growth.
随着安全漏洞成为新闻周期的常规部分,正确保护公共云的对话变得非常有趣。对于网络工程师来说,有几个问题值得讨论:
With security breaches a regular part of the news cycle, the conversation of properly securing the public cloud becomes poignantly interesting. For network engineers, there are several concerns worth discussing:
1. 保护公共交通上的数据
1. Protecting data over public transport
2. 管理云环境之间的安全连接
2. Managing secure connections between cloud environments
3. 多租户环境中的数据隔离
3. Isolating data in multitenant environments
4.了解云环境中基于角色的访问控制(RBAC)
4. Understanding role-based access controls (RBAC) in cloud environments
在局域网中,数据是否应该加密是一个悬而未决的问题。当数据在全资 LAN 上的两个可信端点之间移动时,加密数据是否有任何安全优势?答案很大程度上取决于几个因素:
In a LAN, whether data should be encrypted or not is an open question. When data is being moved between two trusted endpoints across a wholly owned LAN, is there any security advantage in encrypting the data? The answer depends greatly on several factors:
1.数据的性质。例如,健康数据和信用卡数据包含高度敏感的信息。在所有情况下都应该对数据进行加密,但出于监管原因也可能必须进行加密。
1. The nature of the data. For example, health data and credit card data contain highly sensitive information. The data should be encrypted in all circumstances, but also might have to be encrypted for regulatory reasons.
2.组织中如何定义信任。强化网络边界的想法在很大程度上是历史性的,其中一侧是可信网络,另一侧是不可信网络。虽然熟悉会带来一种固有的信任感或舒适感,但网络主机并不能仅仅因为它们是组织拥有的基础设施的一部分而值得信赖。在当今时代,假设存在恶意软件感染,这意味着网络上的所有主机都需要被视为威胁。在网络传输的上下文中,这意味着网络上的任何主机都应被视为收集数据包的可能点。假设主机可以看到线路上的每个数据包,可以采取什么措施来防止受恶意软件感染的主机对数据包的有效负载感兴趣?
2. How trust is defined in an organization. The idea of a hardened network perimeter where a trusted network resides on one side and an untrusted on the other is largely historical. While there is an inherent feeling of trust or comfort borne of familiarity, network hosts are not trustworthy just because they are part of infrastructure owned by an organization. In the modern era, malware infections are assumed, meaning all hosts on a network need to be looked at as threats. In the context of network transport, this means any host on a network should be viewed as a possible point of gathering packets. Assuming the host can see every packet on the wire, what can be done to prevent the packet’s pay-load from being interesting to the malware-infected host?
3.数据是否已经加密。在应用程序堆栈中,可以通过多种方式对数据进行加密。其中一种方法是在应用程序级别,客户端和服务器协商用于混淆数据有效负载的加密方案。例如,安全超文本传输协议 (HTTPS) 是基于传输层安全性 (TLS) 的 HTTP。在存在 HTTPS 的情况下,使用网络工程师用来保护点对点链路的较低级别的 Internet 协议安全 (IPsec) 协议再次加密流量是否有意义?
3. Whether the data is already encrypted or not. In an application stack, the data could be encrypted in several ways. One of those ways is at an application level, where the client and server negotiate an encryption scheme to be used to obfuscate the data payload. For instance, Secure Hypertext Transfer Protocol (HTTPS) is HTTP over Transport Layer Security (TLS). In the presence of HTTPS, does it make sense to encrypt the traffic again with the lower-level Internet Protocol Security (IPsec) protocol of use to network engineers securing point-to-point links?
在考虑公共云时,这些问题都是相关的,但有不同的背景。例如,大多数与公共云服务的连接都是通过公共互联网进行的。公共互联网通常被认为是不可信的传输方式。
When considering the public cloud, these questions are all relevant but have a different context. For instance, most connections to public cloud services are over the public Internet. The public Internet is normally considered an untrusted transport.
虽然可能不需要加密,但最佳常见做法是始终对通过不受信任的传输传输的数据进行加密。加密可能通过 HTTPS 进行,这不是网络工程师关心的问题,因为它发生在应用程序级别。对于网络工程师来说,主要的加密问题是将云环境连接在一起。
While encryption might not be required, it is a best common practice to always encrypt data traveling over an untrusted transport. The encryption might be via HTTPS, which is not a concern for network engineers, as it is happening at the application level. For network engineers, the primary encryption concern will be for connecting cloud environments together.
IPsec 是用于互连云环境的最常用技术。IPsec 具有隧道模式和强加密的优点。这意味着网络工程师可以通过 Internet 将 AWS Virtual Private Cloud (VPC) 连接到本地数据中心。AWS VPC 网络可以被视为与连接到组织的任何其他网络一样的网络,使用 IPsec 隧道作为 WAN 链接。
IPsec is the most common technology used to interconnect cloud environments. IPsec offers the benefit of a tunnel mode as well as strong encryption. This means network engineers can connect an AWS Virtual Private Cloud (VPC) to a local data center across the Internet. The AWS VPC network can be treated as a network like any other network connected to the organization, using the IPsec tunnel as a WAN link.
IPsec 隧道不仅可用于将私有云和公共云环境连接在一起,还可用于将公共云连接到公共云。这意味着一个公共云中的工作负载可以通过公共互联网查询具有加密负载的不同公共云中的工作负载。
IPsec tunnels can also be used to connect not only private and public cloud environments together, but also public clouds to public clouds. This means a workload in one public cloud could query a workload in a different public cloud with an encrypted payload via the public Internet.
请注意,加密和安全并不是同义词。虽然加密是安全基础设施的一部分,但加密本身并不意味着网络或应用程序安全。应用程序被认为安全可能需要的其他安全元素包括身份验证、输入清理、访问控制列表、备份和恢复方案以及深度数据包检查。
Note encryption and security are not synonymous. While encryption is one part of a security infrastructure, encryption by itself does not imply a secure network or application. Additional security elements that might be required for an application to be considered secure include authentication, input sanitization, access control lists, a backup and recovery scheme, and deep packet inspection.
IPsec 的一个重大挑战是管理连接。IPsec 配置很复杂,需要深厚的工程知识。创建 IPsec 隧道后维护虚拟专用网络 (VPN) 是一项持续的任务,以确保所需的隧道保持正常运行,旧隧道在不再需要时被拆除,并在适当时构建新隧道。
A significant challenge of IPsec is managing the connections. IPsec configuration is complex, requiring deep engineering knowledge. Maintaining the Virtual Private Network (VPN) once the IPsec tunnels have been created is an ongoing task to ensure required tunnels stay up, old tunnels are torn down when they are no longer needed, and new tunnels are built when appropriate.
如果供应商不同,IPsec 端点也很难连接。IPsec 是一个标准,但该标准具有足够的灵活性,使得供应商间 IPsec 隧道的创建和维护成为一种令人沮丧的体验。
IPsec endpoints are also notoriously difficult to connect if the vendors vary. IPsec is a standard, but there is enough flexibility in the standard to make the creation and maintenance of inter-vendor IPsec tunnels a frustrating experience.
在公共云网络中,依赖 IPsec 隧道来互连环境,但实现此目的的各种方法充满了管理难题。为了减轻这一负担,供应商开放了一个市场,可以通过集中管理工具来管理 IPsec 隧道。在这种情况下,该工具可以了解组织正在使用的多个云。网络工程师使用该工具选择不同的云进行互连。该工具负责处理 IPsec 详细信息,创建和维护环境之间的隧道。
In public cloud networking, IPsec tunnels are relied upon to interconnect environments, but the variety of ways in which this can be done is fraught with management headaches. To ease this burden, a market has opened for vendors to manage IPsec tunnels via a centralized management tool. In this scenario, the tool is aware of the multiple clouds an organization is using. The network engineer uses the tool to select different clouds to be interconnected. The tool takes care of the IPsec details, creating and maintaining the tunnel between environments.
一些人对公共云提出的另一个担忧是公共云是多租户环境。一个组织的计算基础设施(包括数据)与另一个组织的计算基础设施一起托管在公共云中。这些计算环境是如何分离或划分的?某些租户是否有可能因为共享公共云基础设施而访问另一个租户的数据?
Another concern some raise about public cloud is that public clouds are multitenant environments. The compute infrastructure, including data, of one organization is hosted in a public cloud right alongside the compute infrastructure of another. How are these compute environments separated or compartmentalized? Is there a chance some tenant could gain access to another tenant’s data because they are sharing public cloud infrastructure?
对于这个问题的简短回答是,通常认为风险并不重大。多租户在计算和网络领域得到了很好的理解。虚拟化是允许多个租户共享公共硬件资源的关键技术。
The short answer to this concern is the risk is not generally considered significant. Multitenancy is well understood in computing and networking. Virtualization is the critical technology employed to allow multiple tenants to share common hardware resources.
此外,公共云提供商通常会证明其符合关键安全标准,允许其基础设施用于敏感交易。例如,AWS 和 Microsoft Azure 都是 PCI-DSS 1 级服务提供商,对于那些处理付款的人来说很重要。PCI-DSS 只是云合规性的冰山一角。Azure 和 AWS 都为全球多个合规性相关计划提供认证,并为受这些法规影响的客户提供帮助的支持组织。
In addition, public cloud providers often demonstrate compliance with critical security standards, allowing their infrastructure to be used for sensitive transactions. For instance, both AWS and Microsoft Azure are PCI-DSS Level 1 Service Providers, of interest to those processing payments. PCI-DSS is just the tip of the cloud compliance iceberg. Both Azure and AWS offer certifications for several compliance-related programs the world over, as well as support organizations aiding customers impacted by these regulations.
这是一种迂回的方式来表明多租户并不是希望使用公共云的组织所关心的问题。云中的安全产品强大而细致,超越了简单的租户隔离,并符合复杂的法规。
This is a roundabout way to make the point that multitenancy is not a concern for organizations wishing to consume the public cloud. Security offerings in the cloud are robust and nuanced, moving beyond simple tenant isolation and into compliance with complex regulations.
公共云还提供复杂的控制来限制哪些实体可以访问公共云中的哪些资源。在网络中,这称为基于角色的访问控制(RBAC)。
Public clouds also offer complex controls to limit what entities can access which resources in the public cloud. In networking, this is known as role-based access controls (RBAC).
在网络中,RBAC 已用于控制网络工程师可以在网络设备上执行哪些管理任务。在公共云中,可以类似地控制资源。例如,在 AWS 中,身份和访问管理 (IAM) 服务为各种公共云资源提供精细的角色和权限。此外,还提供大量文档和培训来正确利用这一复杂的资源。
In networking, RBAC has been used to control what administrative tasks network engineers can perform on network equipment. In the public cloud, resources can be similarly controlled. For example, in AWS, the Identity and Access Management (IAM) service offers granular roles and permissions for a variety of public cloud resources. In addition, extensive documentation and training are available to properly leverage this complex resource.
公共云中网络工程师面临的另一个挑战是数据包捕获和分析。在独资网络中,对承载流量的物理交换机和线路的访问意味着可以将流量从一个端口复制到另一个端口以进行捕获,或通过网络分路器进行拦截。这些复制的数据包流经可见性结构(收集、过滤和切片数据包的专用网络设备的集合)到执行数据包分析的工具。
Another challenge facing network engineers in the public cloud is packet capture and analysis. In wholly owned networks, access to the physical switches and wires carrying traffic means traffic can be copied from one port to another for capture, or intercepted via network taps. These copied packets flow across a visibility fabric—a collection of specialized network devices that gather, filter, and slice packets—to tools that perform packet analysis.
公共云中的网络对可见性结构提出了挑战,因为不再可以访问物理交换机或线路来获取流量副本。在物理网络不通的情况下如何抓包?
Networking in the public cloud presents a challenge for visibility fabrics, because there is no longer access to physical switches or wires from which to obtain copies of traffic. How can packets be captured when there is no physical network accessible?
供应商正在通过主机拦截来应对这一独特的挑战。虽然公有云的底层网络基础设施不可访问,但公有云上运行的主机可以访问。这些主机是公共云消费者拥有和操作的虚拟化工作负载。因此,为了捕获公共云中的流量,需要在虚拟工作负载上制作数据包副本,并通过隧道传输到将执行分析的工具。
This unique challenge is being handled by vendors via host interception. While the underlying network infrastructure of the public cloud is not accessible, the hosts running on the public cloud are. Those hosts are the virtualized workloads that public cloud consumers own and operate. Therefore, to capture traffic in the public cloud, copies of the packets are made on the virtual workload and tunneled to a tool that will perform the analysis.
虚拟工作负载运行一个促进复制的代理。代理还将执行过滤,因此并非所有数据包都会复制到分析工具。将所有数据包复制到分析工具可能会因过多的流量而使网络不堪重负,如果只需要特定的数据包,那么这样做是毫无意义的。
The virtual workload runs an agent that facilitates the copy. The agent will also perform filtering, so not all packets are copied to the analysis tools. Copying all packets everywhere to analysis tools could overwhelm the network with excessive traffic, a pointless thing to do if just specific packets are required.
尽管云计算掩盖了基础设施的复杂性,但它并不能消除对深思熟虑的网络设计的要求。企业认为,由于他们向友好的云提供商付费,因此运营基础设施的困难就会消失,但他们忽略了一个关键点。云技术的最佳利用意味着技能的转变,而不是消除专业知识。
Cloud computing, for all the infrastructure complexity it masks, does not eliminate a requirement for thoughtful network design. Businesses sold on the notion that the difficulties of operating infrastructure go away because they have paid a friendly cloud provider are missing a crucial point. Best leveraging of cloud technologies means a shift in skillsets, and not an elimination of expertise.
如果忽视适当的设计,云计算甚至可能会给应用程序性能带来新的问题。希望增加价值的网络工程师他们支持的组织将通过提供充分利用高延迟网络链接的设计来使他们的组织受益。
Cloud computing can even introduce new problems in application performance if an appropriate design is overlooked. Network engineers who wish to add value to the organizations they support will benefit their organizations by offering designs to make the best of high-latency network links.
此外,云计算需要 IT 团队中的所有技术孤岛协同工作。网络工程师有机会发挥领导作用,因为云环境之间的传输是涉及服务之间的 API 调用、存储性能、高可用性和灾难恢复的共性点。对网络如何启用或限制通信的深入了解为所有这些服务的设计提供了信息。
In addition, cloud computing necessitates all technology silos in an IT team working together. Network engineers have an opportunity to lead, as the transport between cloud environments is a point of commonality touching API calls between services, storage performance, high availability, and disaster recovery. A deep understanding of how the network enables or constrains communications informs the design of all these services.
埃尔、托马斯、里卡多·普蒂尼和扎格姆·马哈茂德。云计算:概念、技术和架构。第一版。新泽西州上萨德尔河:Prentice Hall,2013。
Erl, Thomas, Ricardo Puttini, and Zaigham Mahmood. Cloud Computing: Concepts, Technology & Architecture. 1st edition. Upper Saddle River, NJ: Prentice Hall, 2013.
哈瓦拉马尼,伊克拉姆。适合初学者的云计算:在 Amazon 云上构建和扩展高性能 Web 服务器。第一版。Hawramani.com,2016 年。
Hawramani, Ikram. Cloud Computing for Complete Beginners: Building and Scaling High-Performance Web Servers on the Amazon Cloud. 1st edition. Hawramani.com, 2016.
“PCI 合规性 — Amazon Web Services (AWS)。” Amazon Web Services, Inc. 访问日期:2017 年 8 月 25 日。https: //aws.amazon.com/compliance/pci-dss-level-1-faqs/。
“PCI Compliance—Amazon Web Services (AWS).” Amazon Web Services, Inc. Accessed August 25, 2017. https://aws.amazon.com/compliance/pci-dss-level-1-faqs/.
里德、阿奇和斯蒂芬·贝内特。银色的云彩,黑暗的衬里:云计算简明指南。第一版。普伦蒂斯·霍尔,2010。
Reed, Archie, and Stephen G. Bennett. Silver Clouds, Dark Linings: A Concise Guide to Cloud Computing. 1st edition. Prentice Hall, 2010.
Rhee、Injong、Lisong Xu、Sangtae Ha、Alexander Zimmermann、Lars Eggert 和 Richard Scheffenegger。“用于快速长距离网络的 CUBIC。” 互联网草案。互联网工程任务组,2017 年 7 月。https: //datatracker.ietf.org/doc/html/draft-ietf-tcpm-cubic-05。
Rhee, Injong, Lisong Xu, Sangtae Ha, Alexander Zimmermann, Lars Eggert, and Richard Scheffenegger. “CUBIC for Fast Long-Distance Networks.” Internet-Draft. Internet Engineering Task Force, July 2017. https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-cubic-05.
Ruparelia,Nayan B.云计算。马萨诸塞州剑桥:麻省理工学院出版社,2016 年。
Ruparelia, Nayan B. Cloud Computing. Cambridge, MA: The MIT Press, 2016.
温伯格、马特. “亚马逊解释其在云战争中的秘密武器。” 商业内幕。访问日期:2017 年 8 月 25 日。http: //www.businessinsider.com/amazon-web-services-lambda-explained-2015-11。
Weinberger, Matt. “Amazon Explains Its Secret Weapon in the Cloud Wars.” Business Insider. Accessed August 25, 2017. http://www.businessinsider.com/amazon-web-services-lambda-explained-2015-11.
1. 本章指出,从内部拥有和管理的资源迁移到公共云服务可以将资本支出转移到运营支出,并使成本更加可预测。云服务中成本的可预测性依赖于什么?
1. This chapter states that moving from internally owned and managed resources to a public cloud service can move CAPEX to OPEX, and make costs more predictable. What does the predictability of cost rely on in a cloud service?
2. 本章指出云服务中的功能蔓延可能会导致噩梦。将供应商提供的网络设备中专有功能的使用与公共云服务中专有功能的使用进行比较。它们有何不同或相同?
2. This chapter states that feature creep in a cloud service can cause nightmares. Compare the use of proprietary features in vendor-provided network equipment to the use of proprietary features in public cloud services. How are they different or the same?
3. 解释为什么延迟和抖动是在将处理转移到公共云服务时需要考虑的问题。
3. Explain why latency and jitter would be issues to consider when moving processing to a public cloud service.
4.研究数据引力的概念。该术语以及它所代表的问题还有哪些文中未提及的其他含义?
4. Research the concept of data gravity. What are other meanings for this term, and the problems it represents, which are not covered in the text?
5. 为什么选择进出云服务的最佳路径很重要?
5. Why is selecting the best route into and out of cloud services important?
6. 本章中没有考虑许多云安全问题,例如跨处理器内存攻击、数据泄露以及针对云提供商提供机密性。选择这些问题之一,描述该问题,并描述该问题的至少一种解决方案(如果有)。
6. There are many cloud security issues not considered in the chapter, such as cross processor memory attacks, data breaches, and providing confidentiality against the cloud provider. Choose one of these problems, describe the problem, and describe at least one solution to the problem (if there is one available).
世界是由事物组成的。电视机、收音机、灯泡和冰箱每天都围绕着我们每个人,在某些情况下提供基本服务,而在其他情况下只是让我们的生活更简单。物联网( IoT) 要么接受现实,要么提议将这些设备中的每一个连接到互联网(取决于您的观点)。然而,将如此多的“事物”连接到互联网,需要对收集数据和对数据采取行动所需的系统进行彻底的重新思考;庞大的数据量需要第 25 章“分解、超融合和不断变化的网络”中概述的新型设计模式,并依赖于第 25 章中考虑的服务分离和扩展类型。第 27 章,“虚拟化网络功能”。
The world is made up of things. Television sets, radios, light bulbs, and refrigerators all surround each of us every day, providing essential services in some cases, and simply making our lives simpler in others. The Internet of Things (IoT) either accepts the reality or proposes to make real (depending on your perspective) the connection of every one of these devices to the Internet. Connecting this many “things” to the Internet, however, requires a radical rethinking of the systems required to collect and act on data; the sheer amount of data would require the kinds of newer design patterns outlined in Chapter 25, “Disaggregation, Hyperconvergence, and the Changing Network,” and rely on the kinds of service separation and scaling considered in Chapter 27, “Virtualized Network Functions.”
本章从不同角度考虑物联网。
This chapter considers IoT from various perspectives.
2016 年 9 月 13 日星期二,安全分析网站 KrebsOnSecurity.com 关闭。1更准确地说,该网站因分布式拒绝服务 (DDoS) 攻击而无法访问。DDoS 攻击利用了互联网的自由和开放特性。由于互联网的任何部分都可以与任何其他部分通信,因此可以同时从多个互联网位置发起攻击。
On Tuesday, September 13, 2016, the security analysis website KrebsOnSecurity.com was down.1 More accurately, the site was rendered inaccessible by a distributed denial of service (DDoS) attack. DDoS attacks take advantage of the free and open nature of the Internet. Since any part of the Internet can talk to any other part, it becomes possible to launch attacks from many Internet locations at once.
DDoS 攻击的分布式特性使得攻击难以缓解。哪些源地址应该被过滤为攻击者,哪些源地址是合法客户的源地址?这些攻击模式的设计目的是为了让这种情况难以识别,使目标的流量不堪重负,并导致服务请求被拒绝。这种情况下的压倒性流量估计高达 620Gbps。
The distributed nature of a DDoS attack makes the attack difficult to mitigate. Which source addresses should be filtered as attackers, and which source addresses are those of legitimate customers? The attack patterns are purposely designed to make this challenging to discern, overwhelming the target with traffic, and resulting in service requests being denied. The overwhelming amount of traffic in this case was estimated to be as high as 620Gbps.
KrebsOnSecurity.com 背后的安全专家 Brian Krebs 将自己描述为痴迷于安全,与许多其他智能信息技术 (IT) 专家保持着关系,以磨练自己的技能并让自己的文章深入浅出。然而,即使他拥有强大的技术实力,他的网站还是成为了这次 DDoS 攻击的受害者。为什么?
Brian Krebs, the security expert behind KrebsOnSecurity.com, describes himself as obsessed with security, maintaining relationships with many other smart information technology (IT) experts to keep his skills honed and his writing deeply informed. Yet, even with his technical prowess, his site fell victim to this DDoS attack. Why?
2016 年 10 月 21 日星期五,Dyn 提供的域名服务 (DNS) 遭受 DDoS 攻击。这一击比克雷布斯的攻击更加恶毒。这次攻击影响了 Dyn 和 Dyn 的客户,导致他们的服务实际上处于离线状态。如果名称解析器无法到达 Dyn DNS 服务器,则 Dyn 托管的域名将无法解析。多家媒体报道了 AirBnB、Amazon Web Services、Box、FreshBooks、GitHub、Netflix、PayPal、Reddit、Spotify 和 Twitter 等受到的影响。
On Friday, October 21, 2016, the Domain Name Services (DNS) provided by Dyn came under a DDoS attack. This attack was even more nefarious than the one on Krebs. The attack impacted Dyn and Dyn’s customers, rendering their services effectively offline. If name resolvers were unable to reach Dyn DNS servers, then the domain names hosted by Dyn would not be able to be resolved. Various media outlets reported impacts to AirBnB, Amazon Web Services, Box, FreshBooks, GitHub, Netflix, PayPal, Reddit, Spotify, and Twitter, just to name a few.
在针对 Dyn 发起的 DDoS 攻击中,估计有 40,000-100,000 个源产生了惊人的 1.2Tbps 的聚合峰值流量。这种攻击一波又一波地发生,导致服务不断下降,同时部署了缓解措施。
In the DDoS attack launched against Dyn, an estimated 40,000–100,000 sources were estimated to generate an aggregated peak traffic of a staggering 1.2Tbps. This attack came in waves, rendering services down and up and down again while mitigation efforts were deployed.
这些攻击与物联网 (IoT) 有什么关系?在 Krebs 和 Dyn 攻击中,都利用了安全性较差的物联网设备。DDoS 攻击通常通过僵尸网络进行。在僵尸网络中,许多连接互联网的计算设备由于某些安全缺陷而受到损害。当漏洞被利用时,命令和控制软件就会被安装,使设备受到远程方的控制。
What do these attacks have to do with the Internet of Things (IoT)? In both the Krebs and Dyn attacks, poorly secured IoT devices were leveraged. DDoS attacks often work via botnets. In a botnet, many Internet-connected computing devices are compromised due to some security flaw. When the flaw is exploited, command-and-control software is installed, bringing the device under the control of a remote party.
当控制了足够多的设备时,控制器就可以使用它们对目标发起协同攻击。使用互联网作为网络来实施攻击。物联网使创建僵尸网络和发起强大的 DDoS 攻击变得更加容易,例如针对Krebs on Security、Dyn 和许多 Dyn 客户的攻击。
When enough devices are controlled, they can be used by the controller to launch a coordinated attack against a target. The Internet is used as the network to carry out the attack. IoT is making it easier to create botnets and launch powerful DDoS attacks, such as the ones against Krebs on Security, Dyn, and many Dyn customers.
公平地说,物联网并不是罪魁祸首。问题更多的是连接到有意开放的互联网的大量设备之一,这些设备不经常被触及,并且很少有处理能力来投入大量精力来确保安全。毕竟,物联网只是一个方便的术语,用来描述一个将意想不到的事物连接起来的世界的概念。恒温器、车库开门器、冰箱、灯光、视频监控、锁和家庭娱乐设备等家庭自动化智能设备,更不用说汽车,都是勇敢的新物联网世界的一部分。
In fairness, IoT is not specifically to blame. The issue is more one of a huge number of devices connected to an intentionally open Internet, which are not often touched, and rarely have the processing power to dedicate a lot of effort to security. After all, IoT is merely a handy term to describe the notion of a world in which unexpected things are connected. Home automation smart devices such as thermostats, garage door openers, refrigerators, lights, video surveillance, locks, and home entertainment devices, not to mention cars, are part of the brave new IoT world.
在商业领域,智慧城市可以控制停车、优化交通以及对电力分配进行微观管理。智能建筑可以优化环境系统,管理供暖、制冷和照明,与物理建筑设计和效率协议相协调。智能工厂监控制造流程,根除生产中最微小的缺陷,在问题成为制造缺陷之前将其消除。
In the realm of business, smart cities can control parking, optimize traffic, and micromanage electrical power distribution. Smart buildings can optimize environmental systems, managing heating, cooling, and lighting in harmony with physical building design and efficiency protocols. Smart factories monitor manufacturing processes, rooting out the tiniest inadequacies in production, squelching problems before they become manufacturing defects.
能源生产商还在物联网中添加智能设备,依靠传感器来管理石油和天然气生产以及风电场发电的运营。
Energy producers also add smart devices to the IoT panoply, relying on sensors to govern oil and gas production, as well as the operation of wind farms generating electricity.
随着物联网用途的爆炸式增长,物联网设备制造商更加关注功能而不是安全性。太多的物联网设备都配备了容易被破解的安全措施,这使得它们很容易成为添加到僵尸网络以供以后攻击使用的目标。
As the usefulness of IoT has exploded, IoT device manufacturers have focused on functionality more than security. Far too many IoT devices are shipping with easily defeated security measures, making them easy targets to add to a botnet for use in a later attack.
这是本章考虑的物联网带来的几个网络挑战之一:
Herein lies one of several networking challenges introduced by the Internet of Things considered in this chapter:
1.物联网安全。物联网设备应如何保护?与传统计算相比,它们有什么独特之处?物联网安全特性带来了哪些设计限制?
1. IoT security. How should IoT devices be secured? What is unique about them compared to traditional compute? What design constraints are introduced by IoT security peculiarities?
2.物联网连接。随着物联网设备数量激增,网络工程师需要采取哪些策略才能将它们最好地连接到它们所服务的本地网络以及互联网?这是一个比听起来更令人心酸的考虑,使寻址方案和通信协议变得复杂。
2. IoT connectivity. As IoT devices flourish in number, what strategies are required by network engineers to best connect them to the local networks they serve as well as the Internet? This is a more poignant consideration than it sounds, complicating both addressing schemes and communications protocols.
3.物联网数据。物联网设备在某些应用中产生的数据量给互联网带来了负担,网络工程师必须考虑这一点。为了及时利用物联网数据,必须尽快对其进行处理。公共云的延迟带来了网络工程师必须考虑的物联网数据处理挑战。
3. IoT data. The amount of data produced by IoT devices in certain applications places a burden on the Internet that network engineers must consider. For IoT data to be made use of in a timely way, it must be processed as quickly as possible. The latency of public cloud introduces an IoT data processing challenge the network engineer must consider.
一些作家将物联网描述为恐怖互联网;其他2个,愚蠢物联网。3为什么会被嘲笑?克雷布斯的安全故事可能是每个网站的故事,除非使用某种方法来保护这些正在激增的“智能”设备。问题是:如何保护那些处理能力非常弱、内存非常小的设备,并且通常不能或不会定期更新的设备?
Some writers have characterized the IoT as the Internet of Terror;2 others, The Internet of Stupid Things.3 Why the derision? The story of Krebs on Security could be the story of every website unless some method is used to secure these “smart” devices now proliferating. The question is: How can you secure devices that have very little processing power, very little memory, and generally cannot or will not be updated on a regular basis?
这个问题有几个可能的答案——许多安全和网络工程师认为这些答案最终并不能真正提供足够的答案。
There are several possible answers to this question—answers many security and network engineers believe do not, ultimately, actually provide a sufficient answer.
确保物联网设备安全的一个明显想法是像对待任何计算设备一样对待它们:锁定它们。有一些众所周知的流程可以最大限度地减少 Linux 和 Windows 操作系统的攻击面。例如,Windows 10 工作站可能会通过集中控制向其推送特定策略。或者,可以使用运营商预定义的模板来实例化 Linux 服务器,以关闭未使用的服务,从而减少攻击面。
One obvious thought to secure IoT devices is to treat them as you’d treat any computing device: lock them down. There are well-known processes to minimize the attack surfaces of Linux and Windows operating systems. For example, a Windows 10 workstation might have a specific policy pushed to it through centralized control. Or, a Linux server might be instantiated using a template predefined by operators to turn off unused services, reducing the attack surface.
然而,物联网设备并未运行成熟的操作系统,并且可能缺乏以这种方式保护它们所需的工具。此外,保护它们可能会破坏它们的某些功能。
However, IoT devices are not running full-blown operating systems and might lack the tools required to secure them in this way. In addition, securing them might break some of their functionality.
由于物联网设备本身无法保证安全,因此控制对网络的访问是目前物联网设备安全的主要策略。其想法是允许物联网设备访问网络,但严格限制物联网设备可以通过网络访问的内容。虽然互联网是一种开放的传输方式,但包含物联网设备的专用网络并不被认为是开放的。运营商有机会控制流量并限制其物联网设备受到损害的机会。此外,运营商可以防止其 IoT 设备被用作 DDoS 攻击中的 Minion,即使它们受到损害。
Since the IoT devices themselves cannot be secured, controlling access to the network is currently the main strategy for IoT device security. The idea is to allow the IoT device access to the network, but to strictly limit what the IoT device can reach through the network. While the Internet is an open transport, private networks containing IoT devices are not presumed to be open. Operators have the opportunity to control traffic flows and limit the chance their IoT devices become compromised. In addition, operators can prevent their IoT devices from being used as minions in a DDoS attack, even if they are compromised.
物联网设备访问控制可以通过多种不同的方式完成;以下部分考虑基于服务的隔离和端点隔离。
IoT device access control can be accomplished in a couple of different ways; the following sections consider both service-based isolation and endpoint isolation.
在基于物联网服务的隔离模型中,具有共同用途的物联网设备被分配到通过安全服务与网络其他部分隔离的特定网段。安全服务实施策略过滤流量从外部世界进入物联网网络。安全策略还限制 IoT 设备可以访问服务之外的内容,如图29-1所示。
In the IoT service-based isolation model, IoT devices with a common purpose are assigned to a specific network segment that is isolated from the rest of the network by a security service. The security service implements a policy filtering traffic flowing into the IoT network from the outside world. The security policy also limits what the IoT devices can access beyond the service, as illustrated in Figure 29-1.
图 29-1 使用安全服务将 IoT 设备与网络的其余部分分开
Figure 29-1 Using a Security Service to Separate IoT Devices from the Rest of the Network
该策略可以应用于楼宇环境控制等场景。在此模型中,恒温器和 HVAC 控制等物联网设备可以参与公共网络。事实上,他们可能需要在公共网络上进行通信,以便彼此或与集中控制器共享数据。HVAC 环境网络通过安全设备或服务策略的编程行为与所有其他建筑网络隔离。
This strategy can work in scenarios such as building environmental control. In this model, IoT devices such as thermostats and HVAC controls can participate on a common network. In fact, they might need to communicate on a common network to share data with one another or with a centralized controller. The HVAC environmental network is isolated from all other building networks by the programmed behavior of the security appliance or service policy.
为了支持业务功能,可以根据需要对此策略进行例外处理。例如,IT 运营可能需要访问权限来支持 HVAC 功能和监控设备。支持建筑物维护的工作站可能也需要访问网络。安全策略管理此流量,将数据包流量限制在绝对需要的范围内。
Exceptions to this policy can be made as needed to support business functions. For instance, IT operations might require access to support HVAC functions and monitor equipment. Workstations supporting building maintenance might require access to the network as well. The security policy governs this traffic, limiting packet flows to what is absolutely required.
这一策略的结果是物联网设备变得更加难以远程访问。物联网设备本身仍然不安全,但它们的攻击面已被隔离,以尽可能少地将它们暴露给外部主机,同时仍然允许它们执行其功能。
The result of this strategy is that IoT devices become much more difficult to access remotely. The IoT devices themselves are still insecure, but their attack surfaces have been isolated to expose them to as few outside hosts as possible while still allowing them to perform their functions.
端点隔离将基于设备的隔离的理念更进一步。在这种方法中,网络上的每个物联网设备都与网络上的所有其他设备隔离。这是在入口网络端口进行管理的。在物联网设备插入网络的地方,有一个过滤器严格限制设备可以与哪些其他系统通信,反之亦然。
Endpoint isolation takes the idea of appliance-based isolation one step further. In this approach, every IoT device on the network is isolated from every other device on the network. This is managed at the ingress network port. Where the IoT device is plugged into the network, there is a filter in place strictly limiting what other systems the device can communicate with, and vice versa.
端点隔离的问题是管理负担之一。为网络上的每个 IoT 端点维护适当的白名单往好了说是乏味的,往坏了说对于 IT 运营团队来说也是一个无法扩展的挑战。
The problem with endpoint isolation is one of administrative burden. Maintaining appropriate whitelists for every IoT endpoint on the network is tedious at best and an impossible-to-scale challenge for IT operations teams at worst.
为了应对这一挑战,一些依赖端点隔离的物联网安全解决方案采用了集中管理。在这种情况下,管理员创建安全策略,并将它们分配给组。然后,物联网设备被放置在适当的位置组。中央控制器负责将适当的流量过滤策略推送到入口网络端口,隔离物联网设备。图 29-2说明了这一点。
To handle this challenge, some IoT security solutions relying on endpoint isolation employ central management. In this scenario, an administrator creates security policies, assigning them to groups. Then, IoT devices are placed into the proper groups. The central controller takes care of pushing the appropriate traffic filtering policy to the ingress network port, isolating the IoT device. Figure 29-2 illustrates.
对于单独运行的物联网设备来说,端点隔离是一种有效的方法。例如,端点隔离在医疗保健领域非常有效,其中医疗设备可能只需要访问网络上的少量其他系统,并且不作为协作传感器组的成员参与。
Endpoint isolation is an effective approach for IoT devices that operate as loners. For example, endpoint isolation works effectively in healthcare, where medical devices might require access to just a small number of other systems on the network, and do not participate as a member of a collaborative sensor group.
Unikernels 是一项可能对物联网安全产生重大影响的技术。Unikernels 是操作系统的精简版本,仅包含其支持的应用程序严格所需的功能。这种方法极大地减少了底层系统的攻击面。
One technology effort that could significantly impact IoT security is unikernels. Unikernels are stripped down versions of operating systems that include just the functionality strictly required for the application they support. This approach dramatically reduces the attack surface of the underlying system.
物联网设备中的“攻击面”是什么意思?操作系统通常默认附带许多库和支持应用程序,这些库和应用程序是根据操作系统 (OS) 发行版创建者的突发奇想进行选择的。操作系统以这种方式捆绑,因为已知资源是常用的,即使它们并不总是被使用。攻击者将尝试利用任何可能的资源,希望在运行漏洞时发现易受攻击的代码。
What is meant by “attack surface” in the context of IoT devices? Operating systems often ship with many libraries and supporting applications by default, selected at the whim of the operating system (OS) distribution creators. Operating systems are bundled in this way because the resources are known to be commonly used, even if they are not always used. Attackers will attempt to leverage any resources they can, hoping to discover vulnerable code when they run their exploits.
例如,操作系统可能包含一个过时的存储库,由发行版制造商包含,以防用户在旧系统上运行该操作系统。如果旧的存储库不需要访问系统中的任何磁盘硬件,那么它对操作员来说既无用,又可能被攻击者利用。该库构成了可被攻击的“表面”的一部分。
For example, an operating system might include an antiquated storage library, included by the distro makers just in case a user is running the OS on an older system. If an old storage library is not required to access any of the disk hardware in the system, it is both useless to the operator and potentially exploitable by attackers. The library makes up part of the “surface” that can be attacked.
因此,在附带完整操作系统发行版的物联网设备上,存在比必要的攻击面更大的攻击面。Unikernels 去除了操作系统或应用程序不需要的守护进程、库和应用程序。结果是一个准系统、高效的操作系统环境,仅包含支持应用程序运行所需的内容。
Thus, on IoT devices shipped with full operating system distributions, a larger than necessary attack surface is present. Unikernels strip out the daemons, libraries, and applications not required by the operating system nor the application. The result is a barebones, highly efficient operating system environment containing just what is needed to support the applications running.
在这种情况下,不要消极地看待“准系统”。操作系统在多种环境下运行。有些被最终用户积极使用,例如桌面操作系统、编程环境、数字媒体创建等。在这种情况下,准系统 Unikernel 将是一个过度受限的平台,可能会给用户带来不便,甚至达到疯狂的地步。
Don’t think of “barebones” negatively in this context. Operating systems function in multiple contexts. Some are used actively by end users—for example, as desktop operating systems, programming environments, digital media creation, and so on. In this context, a barebones unikernel would be an overly constrained platform likely to inconvenience the user to the point of madness.
然而,在出于单一目的制造的专用设备的背景下,unikernel 似乎非常适合。为交付单个应用程序而构建的操作系统(很可能是在永远不会改变的定制硬件上)迫切需要一个安全、高效的环境。
However, in the context of a specialty device manufactured with a singular purpose, a unikernel seems perfectly suited. An operating system built for the purpose of delivering a single application, most likely on a customized bit of hardware that will never change, cries out for a secure and efficient environment.
除了流程影响之外,物联网设备制造商采用单内核方法几乎没有概念上的缺点。尽管有这些优点,Unikernel 尚未被 IoT 设备制造商广泛采用。
Process impact aside, there is little conceptual downside to an IoT device manufacturer taking the unikernel approach. Despite the advantages, unikernels have not been widely adopted by IoT device manufacturers yet.
这里提出Unikernels是为了增强物联网设备的安全性,但是有多少物联网设备的消费者会理解这一点并做出相应的购买呢?或许,拥有 Unikernel 知识和足够购买力的消费者可以要求制造商正确实施 Unikernel 作为其应用程序运行的基础平台。
Unikernels are raised here as a security enhancement for IoT devices, but how many consumers of IoT devices will understand this point and make their purchases accordingly? Perhaps consumers, armed with the knowledge of unikernels and sufficient buying power, could demand manufacturers properly implement unikernels as the base platform their applications run on.
本章介绍了开放互联网的概念,互联网有意开放的性质使得自由通信和恶意攻击成为可能。尽管支持开放互联网,但互联网工程任务组 (IETF) 提供了有关最佳实践的观点,并编纂为最佳当前实践 (BCP) 文件。
This chapter was introduced with the notion of an open Internet, the intentionally open nature of the Internet that has made possible free communication as well as nefarious attacks. Although supportive of the open Internet, the Internet Engineering Task Force (IETF) offers perspectives on best practices, codified as Best Current Practice (BCP) documents.
IoT 讨论中感兴趣的 BCP 之一是 BCP38,它指向 RFC2827:网络入口过滤:击败采用互联网协议 (IP) 源地址欺骗的拒绝服务攻击。BCP38 建议所有网络运营商过滤具有不适当源地址的流量。“不适当”是指不应用于发起数据包或不应出现在接收数据包的接口线路上的源地址。
One such BCP of interest to the IoT discussion is BCP38, which is a pointer to RFC2827: Network Ingress Filtering: Defeating Denial of Service Attacks Which Employ Internet Protocol (IP) Source Address Spoofing. BCP38 proposes all network operators filter traffic with inappropriate source addresses. “Inappropriate” means source addresses that should not be used to originate packets or should not appear on the wire on the interface where they were received.
例如,RFC19184指定仅供私人使用的地址块:
For example, RFC19184 specifies address blocks for private use only:
• 10.0.0.0/8
• 10.0.0.0/8
• 172.16.0.0/12
• 172.16.0.0/12
• 192.168.0.0/16
• 192.168.0.0/16
RFC1918 块不可通过公共 Internet 路由。包含来自 RFC1918 地址空间的源地址的流量永远不应出现在公共 Internet 上。IPv6 还具有多种全局不可路由地址,例如链路本地地址和不可路由的全局唯一地址。5因此,BCP38(RFC2827,6由 RFC3704 更新7 ) 建议对其进行过滤。
RFC1918 blocks are not routable across the public Internet. Traffic containing source addresses from RFC1918 address space should never appear on the public Internet. IPv6 also has several kinds of globally unroutable addresses, such as link local addresses, and unroutable globally unique addresses.5 Therefore, BCP38 (RFC2827,6 updated by RFC37047) suggests they should be filtered.
DDoS 攻击通常在攻击中使用欺骗性(虚假)源地址。攻击者不需要响应,并且混淆源地址使得追踪传播攻击的实际主机变得更加困难。RFC1918 地址在这里很有用,但任何地址都可能并且正在用于 DDoS 攻击。
DDoS attacks often use spoofed—fake—source addresses in their attack. The attackers don’t require a response, and obfuscating source addresses makes it more difficult to track down the actual hosts propagating the attack. RFC1918 addresses are useful here, but any addresses could be, and are, used in DDoS attacks.
通过丢弃欺骗地址的流量,DDoS 攻击应该至少部分缓解。编写删除地址块(例如 RFC1918)的过滤器列表非常简单。同样简单的是对 bogon 的过滤,其中包含不可路由的地址块以及未分配的公共地址块。8单播反向路径转发 (uRPF) 确保仅当可通过接收数据包的接口到达源地址时才转发流量。
By dropping traffic with spoofed addresses, DDoS attacks should be at least partially mitigated. Writing filter lists that drop address blocks such as RFC1918 is straightforward. Also straightforward is the filtering of bogons, containing nonroutable address blocks, plus unassigned public address blocks.8 Unicast Reverse Path Forwarding (uRPF) ensures traffic is only forwarded if the source address is reachable through the interface through which the packet was received.
从 IoT 的角度来看,BCP38 和 uRPF 是最佳实践,因为它们有助于遏制来自受感染 IoT 设备的某些类型的攻击。BCP38 中的建议并未得到应有的广泛部署,因为部署这些技术涉及各种操作和性能问题。例如,Krebs 和 Dyn DDoS 攻击中使用的 Mirai 僵尸网络似乎来自欺骗地址——据报道,数千万个地址来自数万个来源。然而——攻击还是成功了。
From the standpoint of IoT, BCP38 and uRPF are best practices because they help contain certain types of attacks sourced from compromised IoT devices. The recommendations in BCP38 are not as widely deployed as they could be because of various operational and performance issues involved in deploying these techniques. For instance, the Mirai botnet used in the Krebs and Dyn DDoS attacks appeared to come from spoofed addresses—reportedly tens of millions of addresses from tens of thousands of sources. And yet—the attacks were successful.
有趣的是,物联网并不给网络工程师带来任何新的安全挑战。DDoS、地址欺骗、入口过滤等问题对于网络专业人士来说是很熟悉的。然而,物联网使这些问题变得更加尖锐。
Interestingly, IoT does not represent any new security challenges to network engineers. The issues of DDoS, address spoofing, ingress filtering, etc., are familiar to networking professionals. However, IoT makes these issues more poignant.
分布式拒绝服务攻击一直是令人痛苦的。然而,安全性差、容易被利用的 IoT 设备的激增使得 DDoS 攻击更容易执行,一旦发起,危害性也更大。因此,业界被迫解决这个问题,提醒网络工程师和设备制造商采取缓解策略。
Distributed denial of service attacks have always been painful. However, the proliferation of poorly secured, easily exploitable, IoT devices has rendered DDoS attacks easier to execute and more harmful once launched. Thus, the industry has been forced to address the issue, reminding network engineers and equipment manufacturers of the mitigation strategies.
物联网还增加了风险,因为物联网设备往往会收集攻击者潜在感兴趣的数据。例如,攻击者可能非常有兴趣能够通过物联网设备访问建筑物。来自天然气管道的物联网传感器数据怎么样?虽然本章重点讨论物联网设备被用作恶意软件孵化器,但网络运营商应该记住,物联网设备本身也代表着机会目标,因为它们有时可以访问数据。
IoT also raises the stakes because IoT devices tend to gather data of potential interest to attackers. An attacker, for instance, might be very interested in being able to gain access to a building through an IoT device. What about IoT sensor data coming from a natural gas pipeline? While this chapter has focused on IoT devices being used as incubators for malware, network operators should remember IoT devices also represent targets of opportunity in and of themselves because of the data they sometimes have access to.
物联网连接要求代表了网络工程师面临的另一组挑战。典型的网络设备由电网按预测供电,并连接到本地有线以太网或无线 IP 网络。这类网络设备确实代表了物联网的重要组成部分,而且很简单,因为它们以熟悉的方式连接到网络。
A different set of challenges for the network engineer is represented by IoT connectivity requirements. Typical network devices are powered predictably by the electrical grid, and are connected to a local wired Ethernet or wireless IP network. These sorts of network devices, which do represent a significant portion of the Internet of Things, are straightforward, as they are connected to the network in familiar ways.
然而,许多物联网设备无法以典型方式连接到网络。它们可能部署在企业级 WiFi 网络无法到达的广阔地理区域。
However, many IoT devices are not able to be connected to the network in the typical fashion. They might be deployed over a wide geographic area, where enterprise-class WiFi networks do not reach.
其他物联网设备可能采用电池供电,需要特别低的功耗才能长时间保持功能而无需维护。考虑到地理和电池电量的这些限制,可以使用什么样的网络技术?
Other IoT devices might be battery-powered, requiring an especially low power draw to remain functional for a long time without maintenance. Given these constraints of geography and battery power, what sort of networking technologies can be used?
蓝牙低功耗 (BLE) 已被蓝牙特别兴趣组织 (SIG) 称为专为物联网而构建。9作为公认的行业标准,经典蓝牙和现在的 BLE 确实对物联网设备产生了积极影响,特别是可穿戴设备和智能家居设备等消费领域的设备。
Bluetooth Low Energy (BLE) has been billed by the Bluetooth Special Interest Group (SIG) as having been built for the Internet of Things.9 As a well-recognized industry standard, Bluetooth Classic and now BLE have indeed had a positive impact on IoT devices, particularly those in the consumer space such as wearables and smart home devices.
蓝牙(包括 BLE)是一种短距离协议。短程是指大约 100 米或更短的距离。因此,蓝牙常见于家庭、汽车和个人区域网络应用(例如可穿戴设备)。
Bluetooth, including BLE, is a short-range protocol. Short-range means distances of approximately 100 meters or less. Therefore, Bluetooth is commonly found in home, auto, and personal area network applications such as wearables.
与经典蓝牙相比,BLE 在降低功耗方面有何不同?一般答案相当直观:BLE 做得更少,而且“做得更少”的频率也更少。这并不意味着 BLE 功能贫乏或无能力。相反,这里使用的“ less ”一词意味着 BLE 旨在执行特定的网络功能,避开其他功能,所有这些都受到最小功耗的约束。
What does BLE do differently to reduce power consumption compared to Blue-tooth Classic? The general answer is rather intuitive: BLE does less, and does “less” less often. This does not mean BLE is feature-poor or incapable. Rather, the word less as it is used here means BLE is designed to perform specific networking functions, eschewing others, all bound by the constraint of minimal power consumption.
与经典蓝牙相比,BLE 在数据速率、吞吐量和连接建立时间方面的表现明显较差:
Compared to Bluetooth Classic, BLE notably does less in the areas of data rates, throughput, and connection setup time:
•数据速率。Classic 的指定速率为 1–3Mbps,而 BLE 速率低至 0.125Mbps,高至 2Mbps。
• Data rates. Classic is specified for 1–3Mbps, while BLE rates are as low as 0.125Mbps and as high as 2Mbps.
•吞吐量。经典规范为 0.7–2.1Mbps,而 BLE 指定为低得多的 0.27Mbps。
• Throughput. Classic specification is for 0.7–2.1Mbps, while BLE is specified for a much lower 0.27Mbps.
•连接建立时间。经典蓝牙连接建立时间约为 100 毫秒,而 BLE 将其缩短至 6 毫秒。
• Connection setup time. Classic Bluetooth connection setup time is around 100ms, while BLE reduces this to 6ms.
功耗本身并不属于官方蓝牙规范的一部分,但一般经验表明,经典蓝牙的功耗约为 1W,而 BLE 的功耗范围在 0.01 至 0.50W 之间。
Power consumption itself is not part of the official Bluetooth specification, but common experience suggests Bluetooth Classic hovers around a 1W power draw, while BLE ranges between 0.01 and 0.50W.
节能部分来自占空比。设备(在本例中为蓝牙主机芯片)必须通电多长时间才能完成其任务并返回睡眠状态?将经典蓝牙与 BLE 进行比较时,BLE 占空比在时间和/或频率上有所减少,从而大大降低了功耗。
The power savings is coming from, in part, the duty cycle. How long must a device be powered—in this case, the Bluetooth host chip—before it has completed its task and can go back to a sleep state? When comparing Bluetooth Classic to BLE, the BLE duty cycles are reduced in time and/or frequency, resulting in a greatly reduced power draw.
例如,开发人员可以设置影响处于连接状态的 BLE 设备的多个占空比参数。这里有两个:
For example, a developer can set several duty cycle parameters affecting a BLE device in a connected state. Here are two:
•数据交换之间的间隔称为“连接间隔”。较高的间隔可降低功耗,但代价是应用程序性能可能会降低,因为每个间隔只能交换一次数据。如果设备处于睡眠状态,则无法发送数据。
• The interval between data exchanges known as the “connection interval.” Higher intervals reduce power consumption, the tradeoff being application performance could be reduced because data can only be exchanged once each interval. Devices cannot send data if they are sleeping.
•从机延迟。在蓝牙配对中,一个设备是主设备,另一个设备是从设备。假设没有数据要发送,从设备可以配置为在合理的间隔范围内不与主设备进行检查,每个跳过的间隔都可以节省电量。
• Slave latency. In a Bluetooth pairing, one device is the master, and the other is the slave. Assuming no data to send, the slave can be configured to not check in with the master for a reasonable range of intervals, each skipped interval saving power.
BLE 的成果是使某些物联网设备能够使用硬币大小的电池运行数周、数月或数年。然而,BLE 的占空比降低以及由此产生的低功耗意味着它不太适合音频流等应用。因此,在撰写本文时,BLE 耳机尚不可用。为什么不?
The result of BLE is the enablement of certain IoT devices to run weeks, months, or years on coin-sized batteries. However, BLE’s reduced duty cycles and resulting low power draw mean it is poorly suited for applications such as audio streaming. Therefore, BLE headphones are, at the time of this writing, not available. Why not?
音频流需要频繁的工作周期,因为主设备和从设备之间总是有新的音频信息要发送。为了使音频流在 BLE 环境中工作,可能需要设计新的音频编解码器来提供适合实现 BLE 的“低能耗”部分的发送/接收占空比。
Audio streaming demands frequent duty cycles, as there is always new audio information to be sent between master and slave. To make audio streaming work in a BLE context, new audio codecs might need to be devised to come up with a send/receive duty cycle amenable to fulfill the “low energy” part of BLE.
用于物联网的其他低功耗、短距离协议包括 Zigbee 和 Z-Wave。
Other low-power, short-range protocols in use for IoT include Zigbee and Z-Wave.
除了消费领域之外,物联网也在工业领域得到了广泛应用。工业应用通常需要超出电力充足且连接良好的建筑物的舒适范围的网络服务。
Aside from the consumer space, the Internet of Things has seen uptake in the world of industry. Industrial applications often require networking services extending beyond the comfortable confines of a well-powered and well-connected building.
市政当局可以利用物联网在消防栓、停车场、街道照明和废物管理等领域利用智能城市技术。农业可以使用物联网来管理土地和灌溉以及跟踪动物。公用事业公司可以使用物联网来计量天然气和水的使用量。
A municipality might use IoT to leverage smart city technology in areas such as fire hydrants, parking, street lighting, and waste management. Farming can use IoT to manage land and irrigation and to track animals. Utilities could use IoT to meter gas and water usage.
为了满足许多此类场景的距离和低功耗要求,人们创建了低功耗远程通信协议。LoRaWAN(远程广域网)就是其中之一。10
To handle the distance and low-power requirements of many of these scenarios, low-power, long-range communications protocols have been created. One such is LoRaWAN (Long Range Wide Area Network).10
LoRaWAN 是一种线性调频扩频无线通信协议,在 1GHz 以下的免许可频谱中运行,创建低功耗广域网 (LPWAN)。基于 LoRaWAN 的 LPWAN 在 2Km 至 15Km 范围内提供大约 0.3Kbps 至 50Kbps 的数据速率,具体取决于环境。LoRaWAN 是一种安全协议,提供工厂预编程的网络身份验证和参与节点的无线激活,以及多层强加密。
LoRaWAN is a chirp spread spectrum, wireless communications protocol operating in unlicensed spectrum below 1GHz, creating a low-power wide area network (LPWAN). A LoRaWAN-based LPWAN offers data rates between roughly 0.3Kbps and 50Kbps over a range of 2Km to 15Km, depending on the environment. LoRaWAN is a secure protocol, offering both factory preprogrammed network authentication and over-the-air activation of participating nodes, as well as multiple layers of strong encryption.
LoRaWAN 网络包括两个主要组件:
A LoRaWAN network includes two major components:
• 与之通信的终端节点。在这种情况下,这些是物联网传感器。
• The end nodes with which to communicate. In this context, these are IoT sensors.
• 传感器通过LoRaWAN 与网关进行通信。LoRaWAN网关是LoRaWAN网络与用于处理传感器数据的传统无线或有线网络之间的桥梁。该网关用于解密通过 LoRaWAN 无线电接收的入站传感器数据,并重新打包有效负载以通过传统网络进行传输。
• The sensors communicate via LoRaWAN back to a gateway. The LoRaWAN gateway is a bridge between the LoRaWAN network and a traditional wireless or wired network used to process the sensor data. The gateway serves to decrypt inbound sensor data received via LoRaWAN radio and repackage the payload for transport across the traditional network.
LoRaWAN 所做的妥协是低比特率,使其能够作为低功耗、高范围的网络协议。LoRaWAN 网络上的数据吞吐量看似很小,最高可达 50Kbps,尤其是考虑到数据中心 100Gbps 的以太网速度很常见。
The compromise LoRaWAN makes, allowing it to function as a low-power, high-range network protocol, is low bit rates. The data throughput across a LoRaWAN network is seemingly miniscule at a max of 50Kbps, especially when considering data center Ethernet speeds of 100Gbps are commonplace.
然而,LoRaWAN 的低带宽代表了解决特定网络挑战的设计。对于许多应用来说,物联网传感器不需要传输或接收大量数据,但它们确实需要使用电池电源进行长距离通信。因此,LoRaWAN 非常适合它的用途,它提供了一种长距离传输少量数据的节能方式。
However, LoRaWAN’s low bandwidth represents a design solving a specific networking challenge. For many applications, IoT sensors do not need to transmit or receive enormous amounts of data, but they do need to communicate over long distances using battery power. Thus, LoRaWAN is fit for purpose, providing a power-efficient means of transmitting small amounts of data over long distances.
LoRaWAN transmitters come in three classifications:
1. A 类设备是最省电的。A 类设备无线电会休眠,除非有数据要发送。一旦数据发送完毕,它们就会在两个接收窗口内保持唤醒状态,在此期间它们可以接收数据。如果需要发送到 A 类设备的数据多于两个接收窗口期间可以传送的数据,则数据必须排队,直到下一个接收窗口打开。
1. Class A devices are the most power efficient. Class A device radios sleep unless they have data to send. Once data has been sent, they remain awake for two receive windows, during which they can receive data. If more data needs to be sent to a Class A device than can be delivered during the two receive windows, the data must be queued until the next receive window opens.
2. B 类设备的操作方式与 A 类设备类似,只是增加了预定的接收窗口。A 类设备只能在发送数据后接收数据,而 B 类设备也可以在定期安排的接收窗口期间接收数据。额外的接收窗口需要消耗电力来连接 LoRaWAN 无线电,因此 B 类设备的能效低于 A 类设备。
2. Class B devices operate like Class A devices, except for the addition of scheduled receive windows. While Class A devices can only receive data after sending data, Class B devices can also receive data during regularly scheduled receive windows. The additional receive windows draw power to engage the LoRaWAN radio, and thus Class B devices are less power efficient than Class A devices.
3. C 类设备与 A 类和 B 类设备不同,因为 C 类设备一直在监听,除了传输时。C类设备适用于物联网传感器需要定期从中央网络接收数据且中央网络无法等待远程设备打开接收窗口的应用。除了发射时之外持续收听的缺点是收音机会不断消耗电量。因此,C 类设备预计将使用并网电源,因为在 C 类应用中电池会很快耗尽。
3. Class C devices are different from Class A and B devices because Class C devices listen all the time, except when transmitting. Class C devices are appropriate for applications where the IoT sensor needs to receive data regularly from the central network, and the central network cannot wait for a remote device to open a receive window. The downside of listening constantly except when transmitting means the radio is constantly drawing power. Therefore, Class C devices are expected to use a grid-connected power supply, as batteries would be drained too rapidly in a Class C application.
用于 IoT 的其他低功耗远程协议包括 Sigfox 和 Neul。
Other low-power, long-range protocols in use for IoT include Sigfox and Neul.
IP 网络上的物联网连接面临着额外的挑战:寻址。网络运营商对 IPv4 寻址方案感到满意,并且可能会倾向于在其物联网网络中使用 IPv4。虽然这本身并没有什么问题,但物联网确实带来了一些有趣的挑战,使 IPv6 寻址具有吸引力,包括
IoT connectivity on IP networks faces an additional challenge—addressing. Network operators are comfortable with IPv4 addressing schemes, and might be tempted to use IPv4 in their IoT networks. While there is nothing wrong with this per se, IoT does present some interesting challenges that make IPv6 addressing attractive, including
1.规模。物联网传感器网络潜力巨大,具体取决于应用。IPv6 使得网络中物联网传感器的数量不再是问题,因为地址空间几乎无限大。在 IoT 网络中从精心规划的 IPv6 寻址方案开始意味着无需重新对网络进行寻址。
1. Scale. IoT sensor networks have the potential to be vast, depending on the application. IPv6 makes the number of IoT sensors in a network a nonissue, as the address space is all but infinitely large. Starting with a well-planned IPv6 addressing scheme in an IoT network means never having to readdress the network.
2.消除网络地址转换(NAT)。NAT 通常用于将私有 RFC1918 IPv4 地址空间块转换为一个或多个公共可路由 IPv4 地址。一些网络将此视为一项安全功能,而另一些网络则认为 NAT 很麻烦,因为某些应用程序需要在存在 NAT 的情况下正常运行的解决方法。NAT 还使得设备之间的双向通信变得困难。IPv6 没有地址保护的要求,即将 RFC1918 地址块隐藏在单个公共 IP 地址后面。IPv6 可用于物联网网络,提供双向通信并改进端点识别和身份验证。
2. The elimination of Network Address Translation (NAT). NAT is commonly used to translate blocks of private RFC1918 IPv4 address space into one or more publicly routable IPv4 addresses. Some networks regard this as a security feature, while others consider NAT a nuisance, as some applications require workarounds to function properly in the presence of NAT. NAT also makes two-way communication between devices difficult. IPv6 has no requirement for address conservation, i.e., hiding blocks of RFC1918 addresses behind a single public IP address. IPv6 could be used in an IoT network to offer two-way communication and improve endpoint identification and authentication.
3.移动物联网。如果物联网设备不是静止的,它们可能会在传感器网络内物理地以及逻辑地移动,当它们在与各种接入点和网关的关联之间移动时消失和重新出现。IPv6 网络非常适合移动设备。
3. Mobile IoT. If IoT devices are not stationary, it is possible they will move physically, as well as logically within the sensor network, disappearing and reappearing as they move between associations with various access points and gateways. IPv6 networks are well suited for mobile devices.
网络工程师必须更广泛地考虑 IP 寻址的另一个问题是为 IoT 设备分配 IP 地址是否有意义。从某种意义上说,这是一个奇怪的问题。“嗯,当然,物联网设备需要有一个 IP 地址!所有东西都需要有一个 IP 地址。”
Another question network engineers must consider about IP addressing more broadly is whether it makes sense to assign IP addresses to IoT devices. In one sense, this is a strange question to ask. “Well, of course, there needs to be an IP address on IoT devices! There needs to be an IP address on everything.”
然而,“是否有知识产权”的问题更加微妙。在传统网络中,网络专业人员不需要考虑 IP 标头是否会导致效率低下。硬件 ASIC 经过优化来处理这些标头,并且带宽充足。此外,IP 标头用于识别源地址和目标地址、携带有关它们所属流的有趣信息,并做出转发决策。习惯于快速以太网和无线网络以及无处不在地连接到互联网的设备的网络工程师可能很难想象没有 IP 的网络。
The “IP or not” question is more nuanced, however. In traditional networks, networking professionals do not need to consider whether an IP header introduces inefficiency. Hardware ASICs are optimized to process these headers, and bandwidth is plentiful. In addition, IP headers are used to identify source and destination addresses, carry interesting information about the flow they are a part of, and make forwarding decisions. Network engineers used to fast Ethernet and wireless networks and devices ubiquitously connected to the Internet might have a hard time imagining networks without IP.
对于连接到传统网络和不考虑带宽问题的网络的物联网设备,IP 寻址不会引入任何新问题。然而,物联网领域的大部分领域都将传统网络抛在了后面。
For IoT devices connected to traditional networks and to networks with no bandwidth concerns, IP addressing does not introduce any new issues. However, much of the IoT domain leaves traditional networks behind.
考虑 LoRaWAN 等低功耗 WAN (LPWAN)。当物联网传感器将数据发送到接收器时,最重要的部分是什么?有效负载——数据本身。任何成帧或封装都是开销,即使有必要传递数据。因此,在 LPWAN 中,关键是通过最小化编码和空中发送的位数来减少开销。
Consider low-power WANs (LPWANs) like LoRaWAN. When an IoT sensor is sending its data to a receiver, what is the most important part? The payload—the data itself. Any framing or encapsulation is overhead, even though it is necessary to deliver the data. Therefore, in an LPWAN, the key is to reduce the overhead by minimizing the number of bits encoded and sent over the air.
您可能还记得 LoRaWAN 的最高带宽约为 50Kbps。突然间,IP 标头的大小成为一个有趣的问题。IP 标头的大小是 LoRaWAN 上不存在 IP 的原因。LoRaWAN 数据包发送到 LoRaWAN 网关,其中有效负载可以重新打包成 IP 数据报并发送到其他点。为什么?带有令人讨厌的标头的 IP 数据包效率太低,无法通过 LoRaWAN 网络发送。图 29-3说明了这一点。
You might recall LoRaWAN’s highest bandwidth is around 50Kbps. Suddenly, the size of an IP header becomes an interesting question. The size of the IP header is why there is no such thing as IP over LoRaWAN. LoRaWAN packets go to a LoRaWAN gateway, where the payload can be repacked into IP datagrams and sent to points beyond. Why? IP packets with their pesky headers are simply too inefficient to send over the LoRaWAN network. Figure 29-3 illustrates.
LoRaWAN 缺乏 IP 并不意味着无法将 IP 寻址用于物联网。相反,这意味着每个网络工具都是为特定目的而设计的。例如,在 IETF 的 RFC4944 中,指定 6LoWPAN 将 IPv6 与 IEEE 802.15.4 网状网络集成,包括在任何地方压缩 IPv6 标头可能的。通读 RFC4944 可以发现这种集成的许多技术挑战。11是的,这项技术确实存在,虽然实际实施起来很困难,但可以做到。
LoRaWAN’s lack of IP does not imply it is impossible to use IP addressing for IoT. Rather, it means every networking tool is designed for a specific purpose. For example, in the IETF’s RFC4944, 6LoWPAN is specified to integrate IPv6 with IEEE 802.15.4 mesh networks, including compression of the IPv6 header wherever possible. Reading through RFC4944 exposes the many technical challenges of this integration.11 Yes, the technology exists, and while difficult to implement practically, it can be done.
然后问题又回到网络工程师身上。给定物联网连接技术和寻址方案的选择,哪一种是正确的?答案取决于连接要求。不同的要求,就会有不同的答案。
The question then comes back to the network engineer. Given a choice of connectivity technologies and addressing schemes for IoT, which one is the right one? The answer depends on the connectivity requirements. Different requirements will lead to different answers.
一些物联网传感器会产生大量数据。当必须实时处理数据时会发生什么?数据应该如何处理?
Some IoT sensors produce a significant amount of data. What happens when the data must be acted upon in real time? How should the data be processed?
雾计算(也称为边缘计算)的理念是,某些物联网传感器传输太多数据,无法在远处(例如公共云)进行处理。在这些情况下,物联网数据必须在本地进行分析,才能在许多应用(尤其是工业应用)中具有实时价值。将数据发送到很远的地方、对其进行处理并返回结果会花费很长时间。
The idea of fog computing—also termed edge computing—is that certain IoT sensors stream too much data to be processed far away, such as in the public cloud. In these cases, IoT data must be analyzed locally to be of real-time value in many applications, particularly industrial ones. Sending the data far away, processing it, and bringing back the results would take too long.
除了倾向于高延迟之外,公共云的带宽根本不够便宜,无法为物联网应用设置足够大的管道。因此, “雾”一词的意思是让人联想到附近的云的图像,而不是远处的云。将数据发送到本地雾中可以实现快速分析并及时得到结果。
Besides tending toward high latency, bandwidth to the public cloud is simply not cheap enough to size pipes sufficiently large for IoT applications. Thus, the term fog is meant to conjure an image of a cloud close by, rather than one far away. Sending data into the local fog allows for speedy analysis and timely results.
雾计算模型是在尽可能靠近传感器的地方处理物联网数据。许多物联网设备没有本地数据中心来执行数据处理。那些拥有比公共云更近的数据处理中心的人可能会发现连接通常很脆弱。因此,雾计算有时看起来像搭载在传感器上的小型专用设备接受数据并执行处理。这也可能意味着数据处理软件驻留在物联网网络网关设备上。
The fog computing model is to process IoT data as close to the sensors as possible. Many IoT devices do not have local data centers in which to perform data processing. Those with a data processing center closer than the public cloud might find connectivity is often fragile. Therefore, fog computing sometimes looks like a small, dedicated device piggybacked on the sensor accepts data and performs processing. This could also mean data processing software resident on IoT network gateway devices.
雾计算的用例包括工业物联网 (IIoT) 的许多示例:
Use cases for fog computing include many examples from Industrial IoT (IIoT):
1、机车燃油效率。发动机传感器数据与 GPS 数据相结合,可减少发动机怠速时间,从而显着节省燃油。即使在闲置时,机车也会消耗大量燃料。避免过度燃油燃烧需要在机车穿越景观时进行实时数据分析。
1. Locomotive fuel efficiency. Engine sensor data is coupled with GPS data to reduce engine idle time, saving significantly on fuel. Even at idle, locomotives utilize a large amount of fuel. Avoiding excessive fuel burn requires real-time data analysis as the locomotive moves across the landscape.
2.气蚀警报。实时监控温度、输入压力、输出压力和水流速度,以检测气泡可能被引入水移动系统的条件。这些气泡或气穴会损坏水泵。
2. Cavitation alerts. Temperature, input pressure, output pressure, and water velocity are monitored in real time to detect the conditions in which an air bubble might be introduced into a water moving system. These air bubbles, or cavitations, can destroy water pumps.
3.风能预测。通过分析风力涡轮机数据来预测未来 24 小时的发电量,这是世界上某些电网由政府精心管理的地区的法律要求。
3. Wind energy forecasting. Wind turbine data is analyzed to predict power yield for the next 24 hours, a legal requirement in certain parts of the world where power grids are carefully managed by governments.
4.工厂产量优化。通过分析传感器数据,可以发现导致产品运行不良的制造问题,从而提高整体质量并减少工厂停机时间。
4. Factory yield optimization. Sensor data is analyzed to discover manufacturing problems resulting in a bad run of products, improving overall quality and reducing factory downtime.
虽然雾计算与其说是一种网络范式,不如说是一种数据处理范式,但物联网的计算要求对网络工程师提出了隐含的要求。哪里有大量数据,就必须有一个强大的网络来移动数据。因此,了解物联网传感器数据产生的负载将为物联网网络设计提供信息。
While fog computing is a data processing paradigm more than a networking paradigm, the computing requirements of IoT make an implicit demand on network engineers. Where there is a great deal of data, there must be a capable network to move the data. Therefore, understanding the load created by IoT sensor data will inform the IoT network design.
物联网是一个有趣的新部署和研究领域,最终可能会重塑互联网的看待方式。就流量而言,互联网的主要工作不是为搜索网站和社交关系的人们提供连接,而是将传感器连接到基于云的服务。反过来,这些基于云的服务将渗透到生活的各个领域,引发需要大量思考才能解决的安全和隐私问题。本章提供了有关此类互联网如何运作以及如何保护其安全的一些想法。
The Internet of Things is an interesting area of new deployment and research that is ultimately likely to reshape the way the Internet is seen. Instead of providing connectivity for people searching for websites and social connections, the primary job of the Internet, in terms of traffic flow, will be to connect sensors to cloud-based services. These cloud-based services will, in turn, peek into every area of life, raising security and privacy concerns that need a lot of thought to untangle. This chapter has provided some ideas of how such an Internet might work and how it might be secured.
下一章将讨论另一个面向未来的主题,即网络工程的未来。
The next chapter will consider another future-looking topic, the future of net-work engineering.
贝克、弗雷德和佩卡·萨沃拉。多宿主网络的入口过滤。征求意见 3704。RFC 编辑,2004 年。doi:10.17487/RFC3704。
Baker, Fred, and Pekka Savola. Ingress Filtering for Multihomed Networks. Request for Comments 3704. RFC Editor, 2004. doi:10.17487/RFC3704.
班克斯、伊森. “Foghorn:IIoT 的实时决策。” 数据包推送者,2016 年 9 月 14 日。http: //packetpushers.net/foghorn-iiot/。
Banks, Ethan. “Foghorn: Real-Time Decision Making for IIoT.” Packet Pushers, September 14, 2016. http://packetpushers.net/foghorn-iiot/.
“博贡参考页。” CYMRU 团队。访问日期:2017 年 7 月 17 日。https ://www.team-cymru.org/bogon-reference.html。
“The Bogon Reference Page.” Team CYMRU. Accessed July 17, 2017. https://www.team-cymru.org/bogon-reference.html.
坎特里尔、布莱恩. “Unikernels 不适合生产。” 博客。Joyent,2016 年 1 月 22 日。https: //www.joyent.com/blog/unikernels-are-unfit-for-product。
Cantrill, Bryan. “Unikernels Are Unfit for Production.” Blog. Joyent, January 22, 2016. https://www.joyent.com/blog/unikernels-are-unfit-for-production.
布莱恩·哈伯曼和罗伯特·辛登。唯一的本地 IPv6 单播地址。征求意见 4193。RFC 编辑,2005。doi:10.17487/RFC4193。
Haberman, Brian, and Robert M. Hinden. Unique Local IPv6 Unicast Addresses. Request for Comments 4193. RFC Editor, 2005. doi:10.17487/RFC4193.
希尔顿、斯科特. “10 月 21 日星期五攻击的 Dyn 分析摘要 | 动态博客”。公司的。Dyn,2016 年 10 月 26 日。https ://dyn.com/blog/dyn-analysis-summary-of-friday-october-21-attack/。
Hilton, Scott. “Dyn Analysis Summary of Friday October 21 Attack | Dyn Blog.” Corporate. Dyn, October 26, 2016. https://dyn.com/blog/dyn-analysis-summary-of-friday-october-21-attack/.
休斯顿、杰夫. “愚蠢事物的互联网。” APNIC 博客,2015 年 4 月 30 日。https: //blog.apnic.net/2015/04/30/the-internet-of-stupid-things/。
Huston, Geoff. “The Internet of Stupid Things.” APNIC Blog, April 30, 2015. https://blog.apnic.net/2015/04/30/the-internet-of-stupid-things/.
克雷布斯、布莱恩. “KrebsOnSecurity 遭遇创纪录的 DDoS 攻击。” 博客。克雷布斯安全论,2016 年 9 月 16 日。https ://krebsonsecurity.com/2016/09/krebsonsecurity-hit-with-record-ddos/。
Krebs, Brian. “KrebsOnSecurity Hit with Record DDoS.” Blog. Krebs on Security, September 16, 2016. https://krebsonsecurity.com/2016/09/krebsonsecurity-hit-with-record-ddos/.
“LoRa联盟技术。” 标准机构。劳拉联盟。访问日期:2017 年 7 月 17 日。 https://www.lora-alliance.org/technology。
“LoRa Alliance Technology.” Standards Body. Lora-Alliance. Accessed July 17, 2017. https://www.lora-alliance.org/technology.
Madhavapeddy、阿尼尔和大卫·J·斯科特。“Unikernels:虚拟图书馆操作系统的兴起。” 11号队列,没有。11 日(2013 年 12 月):30:30–30:44。号码:10.1145/2557963.2566628。
Madhavapeddy, Anil, and David J. Scott. “Unikernels: Rise of the Virtual Library Operating System.” Queue 11, no. 11 (December 2013): 30:30–30:44. doi:10.1145/2557963.2566628.
Montenegro、Gabriel、Jonathan Hui、David Culler 和 Nandakishore Kushalnagar。通过 IEEE 802.15.4 网络传输 IPv6 数据包。征求意见 4944。RFC 编辑,2007。doi:10.17487/RFC4944。
Montenegro, Gabriel, Jonathan Hui, David Culler, and Nandakishore Kushalnagar. Transmission of IPv6 Packets over IEEE 802.15.4 Networks. Request for Comments 4944. RFC Editor, 2007. doi:10.17487/RFC4944.
莫斯科维茨、罗伯特·G.、丹尼尔·卡伦伯格、雅科夫·雷克特、艾略特·李尔和吉尔特·扬·德·格鲁特。专用互联网的地址分配。征求意见 1918。RFC 编辑,1996。doi:10.17487/RFC1918。
Moskowitz, Robert G., Daniel Karrenberg, Yakov Rekhter, Eliot Lear, and Geert Jan de Groot. Address Allocation for Private Internets. Request for Comments 1918. RFC Editor, 1996. doi:10.17487/RFC1918.
内维尔-尼尔,乔治。“物联网:恐怖互联网。” 队列15,没有。3(2017 年 6 月):10:19–10:24。doi:10.1145/3121437.3121440。
Neville-Neil, George. “IoT: The Internet of Terror.” Queue 15, no. 3 (June 2017): 10:19–10:24. doi:10.1145/3121437.3121440.
塞尼、丹尼尔和保罗·弗格森。网络入口过滤:抵御采用 IP 源地址欺骗的拒绝服务攻击。征求意见 2827。RFC 编辑,2000。doi:10.17487/RFC2827。
Senie, Daniel, and Paul Ferguson. Network Ingress Filtering: Defeating Denial of Service Attacks Which Employ IP Source Address Spoofing. Request for Comments 2827. RFC Editor, 2000. doi:10.17487/RFC2827.
“SIG 推出蓝牙低功耗无线技术,即下一代蓝牙无线技术。” 社会。蓝牙。访问日期:2017 年 7 月 17 日。https://www.bluetooth.com/news/pressreleases/2009/12/17/sig-introduces-bluetooth-low-energy-wireless-technologythe-next- Generation-of-bluetooth-wireless-技术。
“SIG Introduces Bluetooth Low Energy Wireless Technology, the Next Generation of Bluetooth Wireless Technology.” Society. Bluetooth. Accessed July 17, 2017. https://www.bluetooth.com/news/pressreleases/2009/12/17/sig-introduces-bluetooth-low-energy-wireless-technologythe-next-generation-of-bluetooth-wireless-technology.
1. 解释为什么物联网受到安全从业者如此多的关注。
1. Explain the why IoT has garnered so much attention from security practitioners.
2. 解释使用设备或服务的隔离与端点隔离之间的区别。
2. Explain the difference between isolation using an appliance or service and endpoint isolation.
3.占空比一词与物联网设备的节能有何关系?
3. What does the term duty cycle have to do with power conservation in IoT devices?
4. 举例说明低功耗蓝牙设备与经典蓝牙设备相比如何降低功耗。
4. Give some examples of how Bluetooth Low Energy devices reduce power consumption when compared to Bluetooth Classic devices.
5. 在使用 LoRaWAN 进行无线电通信的 IoT 设备中,解释为什么 A 类设备可以使用电池电源运行,而 C 类设备应该有专用电源。
5. In IoT devices using LoRaWAN for radio communications, explain why a Class A device can run on battery power, while Class C devices should have dedicated power supplies.
6. 为什么IP寻址会给LPWAN通信协议带来技术挑战?
6. Why does IP addressing introduce a technical challenge for LPWAN communications protocols?
7. 用一句话解释边缘(雾)计算为何有用。
7. In one sentence, explain why edge (fog) computing is useful.
8. 研究 BCP38,并解释为什么它没有被广泛部署。
8. Research BCP38, and explain why it is not widely deployed.
9. 本文在 DDoS 攻击的背景下考虑了物联网;研究物联网对大型控制系统(例如电网)的影响,并解释所涉及的风险。
9. The text considers IoT in the context of DDoS attacks; research the impact of IoT on large-scale control systems (such as the power grid), and explain the risks involved.
1 . Krebs,“KrebsOnSecurity 遭遇创纪录的 DDoS 攻击。”
1. Krebs, “KrebsOnSecurity Hit with Record DDoS.”
2 . 内维尔-尼尔,“物联网”。
2. Neville-Neil, “IoT.”
3 . 休斯顿,“愚蠢事物的互联网”。
3. Huston, “The Internet of Stupid Things.”
4 . Moskowitz 等人,专用互联网的地址分配。
4. Moskowitz et al., Address Allocation for Private Internets.
5 . Haberman 和 Hinden,唯一本地 IPv6 单播地址。
5. Haberman and Hinden, Unique Local IPv6 Unicast Addresses.
6 . Senie 和 Ferguson,《网络入口过滤:击败采用 IP 源地址欺骗的拒绝服务攻击》。
6. Senie and Ferguson, Network Ingress Filtering: Defeating Denial of Service Attacks Which Employ IP Source Address Spoofing.
7 . Baker 和 Savola,多宿主网络的入口过滤。
7. Baker and Savola, Ingress Filtering for Multihomed Networks.
8 . “博贡参考页。”
8. “The Bogon Reference Page.”
9 . “SIG 推出蓝牙低功耗无线技术,即下一代蓝牙无线技术。”
9. “SIG Introduces Bluetooth Low Energy Wireless Technology, the Next Generation of Bluetooth Wireless Technology.”
10 . “LoRa联盟技术。”
10. “LoRa Alliance Technology.”
11 . Montenegro 等人,通过 IEEE 802.15.4 网络传输 IPv6 数据包。
11. Montenegro et al., Transmission of IPv6 Packets over IEEE 802.15.4 Networks.
世界上几乎每种文化都有类似的说法
Just about every culture in the world has some saying similar to
那些忘记过去的人注定会重蹈覆辙。
Those who forget the past are doomed to repeat it.
RFC1925 规则 11 是该规则的一个变体,其中规定
A variant of this is RFC1925, rule 11, which states
每一个旧的想法都会以不同的名称和不同的表述再次被提出,无论它是否有效。1
Every old idea will be proposed again with a different name and a different presentation, regardless of whether it works.1
本书以一个简单的想法开始:你可以利用它来发挥你的优势。通过学习旧的内容,您可以了解未来将提出的新内容。这种回顾过去以了解未来的心态可以在这个过程中得到体现:
This book began with a simple idea: you can use this to your advantage. By learning what is old, you can learn what will be proposed as new in the future. This mind-set of looking to the past to understand the future can be codified in the process:
• 正在解决什么问题?
• What is the problem being solved?
• 为解决该问题提出了哪些可能的解决方案?
• What range of possible solutions have been proposed to solve this problem?
• 过去如何实施这些解决方案?
• How have these solutions been implemented in the past?
或许还有两个想法是合理的:
Perhaps two more thoughts are in order, as well:
• 以这种方式解决问题涉及哪些权衡?
• What are the tradeoffs involved in solving the problem this way?
• 该解决方案如何与更大系统中的其他问题及其解决方案交互?
• How does this solution interact with other problems and their solutions in a larger system?
然而,这些规则只能让你对未来抱有悲观的看法。它们提供了可能开发的“护栏”,以及理解和应用这些开发的框架。
These rules, however, only give you a dim view of the future; they provide the “guard rails” of what might be developed, and a framework within which to understand and apply these developments.
更大的市场又如何呢?前面几章和几页中精心阐述的技能和思维方式在五年后会有用吗?还是二十?正如他们所说,预测未来很困难,因为它变化太大。对于网络工程来说尤其困难,因为网络工程在任何时候都可能有不止一个未来。
What of the larger market? Will the skills and mindset so carefully laid out in the previous chapters and pages be useful in five year’s time? Or twenty? Predicting the future, as they say, is hard because it changes so much. It is particularly hard in the case of network engineering, which likely has more than one future at any one time.
本章将采取与前几章不同的方向。每个部分将描述网络工程中的不同运动以及该运动未来可能的发展方向。其中一些趋势会重叠,或在某种程度上相互依赖;其他人将完全独立于其他人。请记住,这些前瞻性的片段是从当前趋势中衍生出来的,因此任何特定的想法在它们实现时都可能会发生根本性的改变,或者它们可能会被发现不切实际,并且根本不会实现。更有可能的未来是所有这些未来在某些网络中成为现实。
This chapter is going to take a different direction from the previous chapters. Each section will describe a different movement in network engineering and where this movement might lead in the future. Some of these trends will overlap, or depend on one another to some degree; others will be completely independent of the others. Remember these forward-looking snippets are spun from current trends, so any particular set of ideas will likely be changed radically by the time they come to pass—or perhaps they will be found impractical, and not come to pass at all. A more likely future is all of these futures become real in some networks.
当在网络工程世界的一个小角落里处理单个网络时,很难记住网络工程世界有多大。虽然与更大的工程世界中的许多其他亚文化相比,网络工程很小,并且相对于更大的世界来说也很小,但它仍然是一个很大的世界,许多不同的子集。总会有一些企业通过不同的思维来迎接未来。有些人会成功,很多人会失败,但所有人都会对信息处理需要是什么样子,以及如何构建网络来完成他们需要完成的工作有不同的看法。
It is difficult to remember, when working on a single network, in a small corner of the network engineering world, how large the network engineering world is. While network engineering is small in comparison to many other subcultures of the larger engineering world, and tiny in terms of the larger world, it is still a large world, with many different subsets. There will always be businesses that take on the future by thinking differently. Some will succeed, many will fail, but all of them will have a different vision of what information processing needs to look like, and hence how to build a network to get done the work they need done.
网络设备的编程配置已经广泛应用于许多网络中;您可以相信,这种趋势在未来将会持续并加速。命令行界面 (CLI) 的时代已基本结束。编程接口将取代 CLI。
The programmatic configuration of network devices is already widely used in many networks; you can be confident this trend will continue and accelerate in the future. The age of the command-line interface (CLI) is largely over; programmatic interfaces will take the place of the CLI.
阻碍多供应商网络中普遍网络自动化的是标准化应用程序编程接口 (API)。在多供应商网络中,用于配置和管理每个设备的 API 因供应商而异。供应商内部和供应商之间的平台功能也有所不同。因此,可以做的事情存在差异,阻碍了自动化在整个行业的快速采用,因为必须编写工具来支持多个供应商及其各种界面的细微差别。
What has stood in the way of pervasive network automation in a multivendor network is a standardized Application Programming Interface (API). In a multivendor network, the API used to configure and manage each device will vary from vendor to vendor. Platform capabilities also vary within and between vendors. Thus, there are differences both in what can be done, preventing rapid, industrywide adoption of automation, as tooling must be written to support multiple vendors with their sundry interface nuances.
该领域解决方案的第一步是重新思考网络设备和协议的建模方式。传统上所做的(事实上,CLI 所做的)是关注要携带的信息。与固定长度数据包编码非常相似(请参阅第 2 章“数据传输问题和解决方案”),该模型嵌入在 CLI 模型中。元数据或有关正在配置的信息包含在配置手册或 CLI 帮助系统中。
The beginning of a solution in this space is rethinking the way network devices and protocols are modeled. What has traditionally been done—in fact, what the CLI does—is to focus on the information to be carried. Much like a fixed length packet encoding (see Chapter 2, “Data Transport Problems and Solutions”), the model is embedded in the CLI model. The metadata, or information about what is being configured, is carried in the configuration manuals or CLI help system.
另一种方法是首先关注建模语言。在此解决方案中,建模语言的设计更像是类型长度值 (TLV) 系统;有关信息的信息与信息本身分开提供。这使得实现能够解决数据表示方式的变化,甚至忽略它们不理解如何显式处理的信息。
An alternative to this is to focus on the modeling language first. In this solution, a modeling language is designed to act more like a Type Length Value (TLV) system; information about the information is provided separately from the information itself. This allows implementations to work around changes in the way data is represented, even ignoring information they do not understand how to process explicitly.
YANG 就是此类建模语言之一,它是由互联网工程任务组 (IETF) 制定和管理的标准。可以使用 YANG 建模语言构建模型来描述与协议或流程的交互,而不是特定的实现。我们的想法不是编写需要特定 API 或网络设备芯片组的自动化流程,而是针对模型进行自动化。然后,自动化过程将与符合所用模型的所有设备一起工作。该模型充当抽象层。
One such modeling language is YANG, a standard shepherded and managed by the Internet Engineering Task Force (IETF). Models can be built describing an interaction with a protocol or process, rather than a specific implementation, using the YANG modeling language. Rather than writing automation processes that expect a specific API or network device chipset, the idea is to automate against a model. The automation process will then work with all devices conforming to the models in use. The model functions as an abstraction layer.
创建此类模型的网络运营商联盟称为 OpenConfig。OpenConfig 参与者包括 Google、AT&T、Facebook、Netflix、CloudFlare 和 Microsoft,以及其他几家主要服务提供商和大型网络运营商。
One consortium of network operators creating such models is called OpenConfig. OpenConfig participants include Google, AT&T, Facebook, Netflix, CloudFlare, and Microsoft, among several other major service providers and large network operators.
OpenConfig 为社区贡献了许多网络模型,涵盖了各种网络元素,包括策略、接口、低层和高层传输协议以及控制平面。OpenConfig 小组还与互联网工程任务组 (IETF) 就这些模型进行了合作,以帮助推动行业采用标准化的网络表示方式。IETF 非常重视建模工作,试图汇集一套完整且可互操作的统一模型。
OpenConfig has contributed many network models to the community, covering a diverse set of network elements, including policy, interfaces, lower- and higher-level transport protocols, and control planes. The OpenConfig group has also worked with the Internet Engineering Task Force (IETF) on these models, to help drive the industry toward a standardized way of representing the network. The IETF has taken the modeling work very seriously, attempting to bring together a complete and interoperable set of unified models.
作为一种建模语言,YANG 并不是特别新。许多 IETF RFC 已发布,定义 YANG 或与 YANG 相关的辅助接口。以下是两个关键的 RFC:
As a modeling language, YANG is not especially new. Many IETF RFCs have been released defining YANG or auxiliary interfaces related to YANG. Here are two key RFCs:
• 2010 年 10 月,发布了173 页的 RFC6020“ YANG——网络配置协议数据建模语言 (NETCONF)” 。
• In October 2010, the 173-page RFC6020, YANG—A Data Modeling Language for the Network Configuration Protocol (NETCONF), was published.
• 2016 年 8 月,RFC7950 共 217 页,标题为“YANG 1.1 数据建模语言”。即使最近发布了 YANG 1.1 规范,IETF 内部也有关于添加 1.1 扩展甚至可能是 YANG 1.2 版本的传言。
• In August 2016, RFC7950 weighed in at 217 pages, titled The YANG 1.1 Data Modeling Language. Even with the YANG 1.1 specification so recently published, there are rumblings within the IETF about extensions to 1.1 being added or possibly even a YANG version 1.2.
截至撰写本文时,已有 220 多个模型正在通过 IETF 批准流程。事实上,YANG 建模已经变得如此普遍,以至于 IETF 创建了一个“YANG 医生”的职能角色,其工作是验证所提出的 YANG 模型。
As of this writing, over 220 models are working their way through the IETF ratification process. In fact, YANG modeling has become so pervasive that the IETF has created a functional role of “YANG doctor” whose job it is to validate proposed YANG models.
YANG 是人类可读的,与可扩展标记语言 (XML) 相比,机器比人更容易阅读 XML。YANG 模型作为模块发布,其中模块包含定义某些特定网络功能所需的所有对象。模块可以通过导入外部模块或使用子模块的包含来引用其他模块。
YANG is meant to be human-readable, in contrast with the eXtensible Markup Language (XML), which tends to be read more easily by machines than people. YANG models are published as modules, where a module contains all the objects required to define some specific networking feature. Modules can reference other modules by importing external modules or using includes of submodules.
YANG模型的结构是具有节点对象的树,符合特定的层次结构:
The structure of a YANG model is a tree with node objects, conforming to a specific hierarchy:
• 模块适合命名空间,用统一资源定位器(URL) 进行描述。
• A module fits into a namespace, described with a Uniform Resource Locator (URL).
• 前缀描述模块内部或其他模块如何引用模块。将 YANG 前缀视为 YANG 模块的简写描述。
• A prefix describes how a module is referenced inside the module or by other modules. Think of a YANG prefix as a shorthand description of a YANG module.
• YANG 中至少有四种节点类型。叶对象包含逻辑上位于树枝末端的值。叶列表是叶对象的序列。列表是多种对象的集合,包括列表和叶列表。容器可以容纳列表、叶列表、叶和其他容器。这些都用于组织 YANG 模型中的元素。
• There are at least four node types in YANG. A leaf object contains a value logically located at the end of a tree branch. Leaf-lists are sequences of leaf objects. Lists are collections of many sorts of objects, including lists and leaf-lists. Containers can hold lists, leaf-lists, leaves, and other containers. These all serve to organize elements in the YANG model.
YANG 的问题不在于建模语言本身,而在于建模语言本身。YANG 得到了标准开发组织以及 OpenConfig 等联盟的充分理解和使用。尽管业界表现出了如此高的热情,但网络设备供应商在将 YANG 模型纳入其产品方面进展缓慢。
The problem with YANG is not with the modeling language itself; YANG is well understood and in use by standards development organizations as well as consortiums such as OpenConfig. Despite this demonstrated level of industry enthusiasm, networking equipment vendors have been slow to include YANG models in their products.
供应商通常不太愿意支持 YANG,因为供应商需要差异化来销售产品。YANG 模型提供了网络功能的基线或最低公分母,因此从某种意义上说,通过 YANG 中描述的一组标准模型配置所有内容将“创造公平的竞争环境”。
Vendors are often slow to support YANG because vendors need to differentiate to sell a product. YANG models offer a baseline of networking functionality, or a lowest common denominator, so in some sense, configuring everything through a standard set of models described in YANG would “level the playing field.”
因此,除非受到大而持久的客户的经济压力,否则供应商对 YANG 的支持并不过分热情。OpenConfig 项目是一项行业尝试,旨在将运营商聚集在一起,围绕特定请求结合他们的购买力来支持 YANG。
Thus, vendors have been not overly enthusiastic with their YANG support, unless compelled financially by large, persistent customers. The OpenConfig project is one industry attempt to bring operators together to combine their buying power around specific requests to support YANG.
标准化网络建模是实现普遍网络自动化的关键。一旦配置网络设备成为一项可预测的工作,那么创建自动化工具就变得更加简单。当今的网络自动化工具承受着过多的接口、方法和输出的负担,必须对这些接口、方法和输出进行标准化,以使自动化流程能够在多供应商网络中以预期的方式工作。标准化 YANG 模型的广泛行业采用将改变网络工程的这一方面。没有像 YANG 这样的东西的自动化将继续向前发展,但不会像使用每个供应商和运营商使用的单一建模语言那样快速和高效。
Standardized network modeling is a key to enabling pervasive network automation. Once configuring a network device is a predictable exercise, then creating automation tooling becomes a simpler task. Today’s network automation tooling is burdened by a plethora of interfaces, methods, and output that must be normalized for automation processes to work in an expected way across a multivendor network. The broad industry adoption of standardized YANG models would change this aspect of network engineering. Automation without something like YANG will continue to move forward, but not as quickly or efficiently as it could with a single modeling language used by every vendor and operator.
第 25 章“分解、超融合和不断变化的网络”描述了超融合计算和存储的兴起。因为网络市场通常遵循广义上的计算和存储市场,所以值得思考网络超融合可能是什么样子。边缘超融合系统由哪些组件组成?
Chapter 25, “Disaggregation, Hyperconvergence, and the Changing Network,” describes the rise of hyperconverged compute and storage. Because the networking market often follows the compute and storage market in a broad sense it is worth putting some thought into what network hyperconvergence might look like. What were the components of the hyperconverged system at the edge?
首先,有白盒;网络世界已经在朝这个方向发展。虽然防火墙、路由器和交换机等网络设备曾经以“设备”模型购买,但网络世界的许多部分正在迅速转向分解模型,其中硬件和软件作为单独的“物品”购买。这使得白盒的概念成为可能——尽管盒子可能不是白色的。亮盒和灰盒这两个术语试图捕获来自品牌供应商的购买盒,但您现在可以根据其硬件功能来购买它们,而不是因为它们的软件功能而购买它们。
First, there is white box; the networking world is already moving in this direction. While network devices such as firewalls, routers, and switches were once purchased in an “appliance” model, many parts of the networking world are quickly moving toward a disaggregated model, where the hardware and software are purchased as separate “things.” This enables the concept of white box—although the box might not be white. The terms bright box and gray box attempt to capture buying boxes from brand-named vendors, but rather than buying them for their software capabilities, you can now buy them for their hardware capabilities.
其次,存在横向扩展。从传统的分层网络设计(尤其是在数据中心)转向更扁平的主干和叶子设计,是网络空间中等效的横向扩展解决方案。您无需购买机箱并根据需要添加卡,而是购买一组单机架单元盒并构建一个网络,该网络可以通过连接更多盒来增加(或减少!)范围和规模。
Second, there is scale out. The move from traditional hierarchical network designs, particularly in the data center, and toward a flatter spine and leaf design is the equivalent scale-out solution in the networking space. Rather than buying a chassis and adding cards as needed, you buy a set of single rack unit boxes and build a network that can be increased (or decreased!) in scope and scale by wiring more boxes in.
第三,有统筹。网络世界中的几种不同趋势正在共同努力,创造真正的池化能力的开端:动态覆盖网络、软件定义网络和网络功能虚拟化的兴起。
Third, there is pooling. Here several different trends in the networking world are working together to create the beginnings of a true pooling capability: the rise of dynamic overlay networks, software-defined networks, and network function virtualization.
为了将这三者结合起来,请考虑由白盒设备构建的主干和叶子网络,以及动态创建的覆盖网络,根据需要提供虚拟资源集。这种网络可以
To combine these three, consider the spine and leaf network built out of white box devices, with a dynamically created overlay network providing virtual sets of resources as needed. This kind of network can be
• 通过向主干和叶底层添加更多框,以及向连接到该底层结构的虚拟机添加更多基于网络的服务来扩展资源
• Scaled in resources by adding more boxes to the spine and leaf underlay, as well as adding more network-based services to virtual machines connected to this underlying fabric
• 通过在覆盖层中构建虚拟网络进行池化,以根据需要使用任意数量的底层设备的服务
• Pooled by building virtual networks in an overlay to consume the services of any number of underlay devices as needed
一个重要的问题是构建这样一个系统所需的覆盖深度;当今的大多数覆盖解决方案都是非常重量级的、全面的隧道,并且基于“第二控制平面”或集中式控制平面(而不是更灵活的混合分布式+集中式控制平面)。这个领域最终需要的是一组更轻量级的控制平面和覆盖系统,它们可以更好地与底层硬件配合——甚至可能根本不是“覆盖”,而是一组可以通过以下方式发送隔离流量的服务:网络无需构建实际的虚拟拓扑。分段路由可以为此类轻量级覆盖解决方案提供路径。
One important question is the depth of the overlay required to build such a system; most of today’s overlay solutions are very heavyweight, full-scale tunneling and based on either a “second control plane,” or a centralized control plane (rather than a more flexible hybrid distributed + centralized control plane). What will eventually be needed in this space is a lighter-weight set of control planes and overlay system that will work with underlying hardware better—perhaps not even an “overlay” at all, but rather a set of services that can send isolated traffic through the network without the work of building an actual virtual topology. Segment routing may provide a path to such lightweight overlay solutions.
尽管该领域有商业解决方案,以及由大型云提供商构建和运营的定制解决方案,但这仍然是一个新兴市场。当今可用的解决方案,无论是基于供应商特定的硬件和软件并专注于数据中心结构中的架顶式 (ToR) 交换机,还是基于服务器中的虚拟机管理程序,通常都会因系统之间缺乏通信而受到阻碍。网络资源(ToR 交换机上的网络处理器)以及覆盖交换要求。此外,这些解决方案受到简单地使系统运行所需的配置量的阻碍,特别是在底层空间中。
While there are commercial solutions in this space, and custom solutions built and operated by large-scale cloud providers, this is still a nascent market. The solutions available today, either based on vendor-specific hardware and software and focused on the Top of Rack (ToR) switch in the data center fabric, or on the hyper-visor in the server, are generally hampered by a lack of communication between the network resources—the network processors sitting on the ToR switches—and the overlay switching requirements. Further, these solutions are hampered by the amount of configuration required to simply get the system going, particularly in the underlay space.
但这些市场正在增长和变化;VMWare、Cumulus 和其他公司正在开发解决方案,随着时间的推移,这些解决方案可能会发展成为这样的超融合解决方案。当然,总会有基于设备的模型。软件和硬件总是作为一个系统购买的。
But these markets are growing and changing; VMWare, Cumulus, and others are working on solutions that will, over time, likely develop into such a hyperconverged solution. There will always be, of course, an appliance-based model; there will always be software and hardware purchased as a single system.
但分解和可编程网络运动正在为一种新型网络铺平道路,更多的是沿着超融合计算、存储和网络访问资源的路线。
But the disaggregation and programmable network movements are paving the way for a new kind of network, more along the lines of hyperconverged compute, storage, and network access resources.
许多超融合网络可能是特定于供应商的;只有特定供应商的设备才能与特定的超融合解决方案配合使用。这种超融合的雏形,与供应商专有的自动化 API 相结合,已经在许多供应商的产品线中显现出来。
Many hyperconverged networks are likely to be vendor specific; only a particular vendor’s gear will work with a specific hyperconverged solution. The beginnings of this kind of hyperconvergence, combined with vendor proprietary APIs for automation, are already apparent in the product lines of many vendors.
根据制造商和专家的说法,基于意图的管理是网络工程的未来。当然,似乎有很多充分的理由来拥抱基于意图的浪潮。
According to the manufacturers and pundits, intent-based management is the future of network engineering. There certainly seem to be a lot of good reasons to embrace the intent-based wave.
例如,当今的网络确实很难配置、维护和故障排除。今天,凌晨 2 点规则几乎总是被违反,原因很简单,因为网络需要支持企业选择运行的应用程序,这给网络的设计和运营带来了很大的复杂性。运营人员在凌晨 2 点试图对该设备上的此配置进行逆向工程,试图找出如果立即修改各个部分中的任何一个来解决问题可能会受到影响的每个应用程序。
For instance, networks are certainly hard to configure, maintain, and troubleshoot today. The 2 a.m. rule is almost always violated today simply because the networks needed to support the applications that businesses choose to run drive a lot of complexity into the design and operation of the network. Operations personnel are left trying to reverse-engineer this configuration on that device at 2 a.m., trying to tease out every application that might be impacted if any of the various pieces are modified to solve a problem right now.
许多明显的问题在于将业务意图转化为设计,然后必须将其转化为配置,然后必须将其转化为数百个不同意图链的组合配置,这些意图链分布在多年的网络运营、供应商变更、以及各个网络工程师的个人喜好、优势和劣势。
A lot of the apparent problem is in translating the business intent into designs, which then must be translated into configurations, which then must be translated into the combined configurations of hundreds of different intent chains spread out over many years of network operation, vendor changes, and the personal preferences, strengths, and weaknesses of individual network engineers.
看起来,仅仅陈述你的意图并让网络将这些意图直接转换为配置肯定会简单得多。雇用所有手动进行翻译工作的工程师可以节省的金额可能足以证明这一改变本身是合理的。在某些虚拟机(可能在供应商的云中)上运行的人工智能进程可以根据您声明的意图、您正在运行的应用程序以及与其他客户的经验来调整您的网络设置,并为每个企业、所有客户提供最佳的网络配置。时间。
It would certainly, it seems, be a lot simpler to just state your intentions and let the network translate those intentions directly into configurations. The amount of money you could save on hiring all those engineers who are doing the translation work manually would probably be enough to justify the change all on its own. An artificially intelligent process running on some virtual machine (perhaps in the vendor’s cloud) can adjust your network settings based on your stated intent, the applications you are running, and experience with other customers, and produce an optimal network configuration for every business, all the time.
但是,当这么多人同时说同样的话时,特别是在通常相反的网络工程世界中,是时候退后一步,考虑在这种急于实现意图的过程中可能存在的权衡。如果你还没有找到权衡,那么你还没有足够努力地寻找。
But when so many people are saying the same thing at the same time, particularly in the normally contrary world of network engineering, it is time to take a step back and consider where the tradeoffs might be in this rush to intent. If you have not found the tradeoffs, you have not looked hard enough.
基于意图的网络涉及哪些权衡?
What are the tradeoffs involved in intent-based networking?
一个好的起点是,工程师在凌晨 2 点坐在家里,使用笔记本电脑工作,尝试解决网络问题(或者至少弄清楚问题是网络问题还是系统的其他部分)。也许基于意图的系统会比当今工程师配置的系统有更好的记录,但这似乎不太可能。事实上,如果涉及人工智能,那么几乎不可能有任何文档,因为没有人真正理解人工智能可能会做出什么决定或为什么。即使忽略人工智能是否能够完成手头工作的问题——监控网络中每个应用程序的每个元素以及每个网络设备的每个元素,将这些信息与每个已安装设备的功能相结合,
A good place to begin is with the engineer sitting at home, working from a laptop, at 2 a.m., trying to resolve a network problem (or at least figure out whether the problem is the network or some other part of the system). Perhaps intent-based systems will be better documented than the engineer-configured systems today, but this does not seem likely. If an AI is involved, there is very little chance there will be any documentation, in fact, as no one really understands what decision an AI might make or why. Even ignoring the problem of whether or not an AI will ever be able to do the job at hand—monitor every element of every application in a network and every element of every network device, combine this information with the capabilities of each installed device, and make fine-grained adjustments in every area to provide optimal utilization and application support for every possible network and business requirement—it is difficult to see how a particular decision can ever be reverse-engineered to determine whether the network is running properly or not.
这里要解决的另一个难题是谁的意图?必须在某个地方有人正在确定哪些因素对确定意图产生影响,以及针对意图的表达应该采取什么措施。虽然人工智能系统可能能够处理其中的一些边缘问题,但人类总是需要至少训练人工智能系统采取什么行动,或者网络设备中任何新功能背后的意图是什么,等等。控制器将对意图的解释从本地工程师(至少可以说)对您的业务目标负责的本地工程师转移到供应商的云或意图服务器。
Another hard problem to solve here is whose intent? There must be someone, somewhere, who is determining which factors make a difference in determining intent and what should be done in response to an expression of intent. While AI systems might be able to handle some of this around the edges, humans will always need to at least train AI systems on what action to take, or what the intent behind any new feature is in networking gear, etc. Moving intent into the controller moves the interpretation of intent to configuration from local engineers, who are (arguably, at least) accountable to your business goals, to a vendor’s cloud or intent server.
下一个要问的问题是:这个意图是什么样的?是不是类似于“将总统的电子邮件优先于接待员的电子邮件”?或者它的粒度更细?如果粒度更细,那么必须有人将业务问题解释为某种形式的“意图语言”(意图 YANG 模型,有人吗?),这意味着理解系统及其对系统做出的任何意图声明的反应。如果目的是停止雇用工程师,那么这不是实现这一目标的途径。为了节省工程人员的资金,所需要的更像是以下模型:管理员说,“优先处理总统的电子邮件”,但随后出现了许多新问题。
The next question to ask is: what does this intent look like? Is it something like “give the president’s email priority over the receptionist’s?” Or is it finer grained? If it is finer grained, then someone must interpret the business problem into some form of “intent language” (an intent YANG model, anyone?), which means understanding the system and its reaction to any sort of intent statement made to the system. If the intent is to stop hiring engineers, this is not the path to get there. What would be needed instead to save money on engineering staff is more like the model where the administrator says, “prioritize the president’s email”—but then a host of new problems arise.
鉴于系统具有某种接口,该接口是标准化的还是特定于供应商的?更有可能的答案是特定于供应商的,因为任何“意图语言”都必须足够丰富才能有用,并允许供应商在销售端到端业务模型时脱颖而出。假设目标是让应用程序以及人类驱动意图接口,那么每个应用程序现在必须能够以某种方式与每个供应商接口进行对话。单一供应商的捆绑迅速从网络硬件和软件转移到整个应用程序生态系统。
Given the system has some sort of interface, will the interface be standardized or vendor specific? The more likely answer is vendor specific, because any “intent language” must be rich enough to be useful and allow the vendors to differentiate themselves in selling into an end-to-end business model. Assuming the goal is for applications to drive the intent interface, as well as humans, each application must now be able to talk to each vendor interface in some way. The single vendor tie-in quickly moves from the networking hardware and software into the entire ecosystem of applications.
最重要的是,在所有这些问题的背后潜伏着一个更大的、系统性的问题:基于意图的界面最终是一种抽象形式。虽然抽象非常有用(事实上,工程师没有它们就无法生存),但它们也有一些副作用,直到深入抽象过程时才意识到。首先,所有抽象都会删除信息,而所有信息删除都会以某种方式降低效率(资源、时间的最佳利用等)。其次,所有重要的抽象都会泄漏:系统外部不可见的事物总是以某种方式传递到下一个级别,但以一种难以理解和管理的方式。
Above all of these questions is a larger, systemic one lurking in the background: intent-based interfaces are ultimately a form of abstraction. While abstractions are very useful—in fact, engineers could not live without them—they also have side effects that are not realized until far into the abstraction process. First, all abstractions remove information, and all information removal reduces efficiency in some way (the optimal use of resources, time, etc.). Second, all nontrivial abstractions leak: things not visible outside the system are always somehow passed through to the next level up, but in a way that is difficult to understand and manage.
这并不是说基于意图的网络是不可能的,也不是说它不会有很好的用途。基于意图的接口可能在狭窄的应用范围内有用,也许广泛部署在大型网络中。基于意图的接口也可能在较小规模的网络或特定类型的拓扑中有用,在这些网络或特定类型的拓扑中,业务迄今为止与信息技术脱节,因此随之而来的低效率和复杂性永远不会成为问题。
None of this is to say intent-based networking is impossible, nor it will not have good uses. Intent-based interfaces will probably be useful in a narrow range of applications, perhaps broadly deployed in large-scale networks. Intent-based interfaces will probably also be useful in smaller-scale networks, or specific kinds of topologies, where the business is so far disconnected from information technology that the attendant inefficiencies and complexities just do not ever become a concern.
无论普遍的开放自动化是否成为现实,将机器学习应用于网络管理都是一个活跃的研究领域。狭义人工智能(ANI)与机器学习的目标(尽管不是技术)有足够的重叠,以至于许多工程师将两者视为同一件事。提供更正式的定义:
Whether or not pervasive open automation ever becomes a reality, applying machine learning to network management is an area of active research. Artificial narrow intelligence (ANI) overlaps with the goals of machine learning (though not the techniques) enough that many engineers will see the two as the same thing. To provide a more formal definition:
•数据挖掘是在大型信息集中发现以前未知的模式的过程。
• Data mining is the process of discovering previously unknown patterns in large information sets.
•机器学习是优化一组输入变量以达到一组指定目标的过程。
• Machine learning is the process of optimizing a set of input variables to reach a specified set of goals.
•狭义人工智能将几种不同的数据挖掘和机器学习子系统组合成一个更大的系统,该系统接近自然(甚至人类)水平的能力来完成某些特定任务。
• Artificial narrow intelligence is combining several different data mining and machine-learning subsystems into a larger system that approaches a natural (or even human) level ability to achieve some specific task.
图 30-1说明了将这三件事放在网络工程环境中。
Figure 30-1 illustrates putting these three things together in a network engineering context.
图 30-1 网络工程背景下的数据分析、机器学习和狭义人工智能
Figure 30-1 Data analytics, machine learning, and artificial narrow intelligence in the context of network engineering
图 30-1说明了如何使用数据挖掘来发现网络中不明显的事物;这些信息可能会驱动机器学习系统,该系统消耗特定的最终网络状态,将挖掘的信息与已知的状态信息相结合,并调整各种网络输入以达到指定的状态。如果足够多的此类系统合并形成一个“类似自然”的系统来管理网络操作的某些部分,这可能(或可能不)被视为 ANI。
Figure 30-1 illustrates how you might use data mining to discover things about your network that are not otherwise obvious; this information might drive a machine-learning system that consumes a specific final network state, combines the mined information with known state information, and adjusts various network inputs in order to reach the specified state. If enough of these kinds of systems are merged to form a “natural-like” system for managing some part of network operations, this might (or might not) be considered ANI.
然而,要将机器学习应用于网络管理,还需要克服一些重大障碍。具体来说,数据分析依赖于能够长期处理一组一致的信息,以便找到变化中的模式和模式。网络的变化率可能太高,噪音太大,数据分析在网络中无法发挥作用工程界和其他领域一样。虽然一些基本的东西可以通过数据分析来学习,例如发现有趣或不寻常的信息流,但使用机器学习来发现更深层次的模式可能很困难,因为网络作为一个系统中的模式可能只是“总是有改变。”
There are major hurdles to overcome in order to apply machine learning to network management, however. Specifically, data analytics relies on being able to process a consistent set of information over long periods of time, in order to find patterns and patterns in the changes. Networks probably have too high of a rate of change, and too much noise, for data analytics to be as effective in the network engineering world as is in other areas. While some basic things might be learnable through data analytics, such as spotting interesting or unusual flows of information, it may be difficult to use machine learning to discover deeper patterns, as the pattern in a network as a system might just be “there is always change.”
机器学习通常与数据分析一样狭隘地关注。机器学习在很大程度上依赖于输入和输出之间的一致连接(无论有多少个)来确定如何调整输入以达到特定的输出。网络作为系统可能没有足够的一致性,无法通过机器学习过程发现这种细粒度的调整,特别是考虑到恒定的变化率可能会困扰机器学习可能依赖的数据分析系统。
Machine learning is often narrowly focused in the same way as data analytics. Machine learning largely relies on consistent connections between inputs and outputs, no matter how many there are, to determine how to adjust the inputs to reach a certain output. There may not be enough consistency in networks as systems to allow this kind of fine-grained adjustment to be discovered through a machine-learning process, particularly given the constant rate of change that could plague the data analytics systems that machine learning would likely rely on.
最后,机器学习系统必须基于现有数据集进行教授或学习。由于每个网络本质上都是为了解决单个问题集而构建的,因此每个网络都可以有效地被视为要解决的独特机器学习问题。这可能会严重阻碍机器学习系统有效“解决”网络管理问题的能力。
Finally, machine-learning systems must be taught, or they must learn, based on an existing data set. As each network is essentially built to solve a single problem set, each network can effectively be treated as a unique machine-learning problem to solve. This could seriously hamper the ability of machine-learning systems to effectively “solve” network management issues.
这些问题是本书中强调的网络工程基本问题的结果:没有“唯一正确的方法”来构建网络、传输系统,甚至系统内的协议。在构建机器学习系统来管理网络时,不存在可以依赖的“网络一般理论”。截至撰写本文时,该领域工作人员的一段引述有助于正确看待这些问题:
These problems are a result of a basic problem in network engineering highlighted throughout this book: there is no “one right way” to build a network, a transport system, or even a protocol within a system. There is no “general theory of networks” you can rely on when building a machine-learning system to manage networks. An extended quote from someone working in this area as of this writing is useful in putting these problems into perspective:
迈耶表示,尽管网络拥有“更多的计算和更多的数据”,但目前尚不清楚如何应用机器学习。他认为,我们缺少的是网络理论。当然,丰富的学术著作支持我们今天使用的网络,但没有统一的理论来定义抽象意义上的网络应该如何运行,或者应该如何构建。构成互联网的网络当然共享一些核心原则,但它们并不是根据中心理论构建的。他们通过反复试验而出现,“一些好的想法和人们互相指导如何去做,”迈耶说。2
Even though networking has “just massively more compute and massively more data” available, it’s not yet clear how machine learning can be applied there, Meyer says. What’s missing, he believes, is a theory of networking. A rich body of academic work backs the networks we use today, certainly, but there is no unifying theory defining how a network, in an abstract sense, ought to behave, or how it ought to be structured. The networks that form the Internet certainly share some core principles, but they weren’t built from a central theory. They emerged through trial-and-error, “some good ideas and people telling each other how to do it,” Meyer says.2
与基于意图的网络一样,随着时间的推移,机器学习和 ANI 可能在网络工程中发挥着狭窄的作用,但您似乎不太可能很快看到由 ANI 驱动的半自治网络。
Like intent-based networks, machine learning and ANI may play a narrow role in network engineering over time, but it seems unlikely you will see semiautonomous networks driven from an ANI anytime soon.
命名数据网络 (NDN) 与内容中心网络 (CCN) 松散相关,依赖于三个简单的观察。首先,互联网协议栈与其他网络系统一样,都是建立在一个细腰上的。在这种情况下,窄腰是互联网协议(IP),如图30-2所示。
Named Data Networking (NDN), which is loosely related to Content Centric Networking (CCN), relies on a simple trio of observations. First, the Internet Protocol stack of protocols, like every other networking system, is built on a narrow waist. The narrow waist is, in this case, the Internet Protocol (IP), as illustrated in Figure 30-2.
所有复杂的系统都是通过这种方式构建的。协议和网络设计模式依靠这些细腰点(或阻塞点)来通过隐藏信息(或状态的抽象)来控制复杂性。
All complex systems are built with some sort of thin waist in this way; protocol and network design patterns count on these thin waist points (or choke points) to control complexity by hiding information (or the abstraction of state).
第二个观察结果是,互联网和大多数网络主要是为了分发信息而设计的,特别是通过元数据整理和描述的信息。第三个观察结果是,IP 并不擅长承载信息,而是被设计用来承载比特。
The second observation is that the Internet and most networks are primarily designed to distribute information—particularly information marshaled and described through metadata. The third observation is that IP is not very good at carrying information, but rather is designed to carry bits.
结合这些观察结果,NDN 提出疑问:为什么互联网的细腰应该是一个不专门从事互联网业务的协议?或者更确切地说,为什么不用一种旨在有效分发数据的协议来取代互联网的细腰呢?一旦你开始将互联网或任何网络视为大型分布式数据库,要解决的问题就与本书中考虑的传输和可达性问题截然不同。图 30-3说明了这个概念。
Combining these observations, NDN asks: why should the thin waist of the Internet be a protocol that does not specialize in what the Internet does? Or rather, why not replace the thin waist of the Internet with a protocol designed to efficiently distribute data? Once you begin to look at the Internet, or any network, as a large distributed database, the problems to be solved become radically different than the transport and reachability problems considered in this book. Figure 30-3 illustrates the concept.
假设您正在寻找Monkees 乐队的歌曲Pleasant Valley Sunday 。从标准 IP 开始,逐步完成从 A 到 F 检索此信息的步骤:
Assume you are looking for the song Pleasant Valley Sunday by the Monkees. Begin with standard IP, walking through the steps to retrieve this information at A from F:
1. 用户点击Pleasant Valley Sunday的搜索结果,这表明可以在http://songserver.com/monkees/pleasant找到该歌曲的副本。
1. A user clicks on a search result for Pleasant Valley Sunday, which indicates a copy of the song can be found at http://songserver.com/monkees/pleasant.
2. 主机操作系统查找 .com,然后查找 Songserver.com,从域名服务 (DNS) 检索 IP 地址。
2. The host operating system looked up .com, then songserver.com, retrieving an IP address from the Domain Name Service (DNS).
3. 主机操作系统开始与 IP 地址的会话,最终开始与 F 的会话。
3. The host operating system begins a session with the IP address, ultimately starting a session with F.
4. 然后,主机操作系统执行任何必要的身份验证步骤,例如在屏幕上放置登录表格,或交易某些证书,甚至可能进行一些金融交易来购买歌曲的副本。
4. The host operating system then performs any necessary authentication steps, such as putting a sign-in form on-screen, or trading some certificate—even perhaps undertaking some financial transaction to purchase a copy of the song.
5. A 处的主机操作系统现在下载歌曲的副本。
5. The host operating system at A now downloads a copy of the song.
在此过程的每一步中,主机都会与其他系统(例如 DNS 服务器和歌曲副本所在的服务器)建立点对点链接。该流量路径上的路由器仅交换数据包;它们不缓存任何信息,也不能参与金融交易或用户身份验证。比较使用 NDN 的过程:
At every step in this process, the host builds a point-to-point link with some other system, such as a DNS server and the server on which the copy of the song resides. The routers along the path of this traffic just switch the packets; they do not cache any information, nor can they participate in the financial transactions or the authentication of the user. Compare the process using an NDN:
1. 用户点击Pleasant Valley Sunday的搜索结果,表明可以从 /com/songserver/monkees/pleasant 获取该歌曲的副本;请注意数据位置顺序的差异。
1. A user clicks on a search result for Pleasant Valley Sunday, which indicates a copy of the song may be obtained from /com/songserver/monkees/pleasant; note the difference in the ordering of the location of the data.
2. 主机操作系统将此请求发送到其上游路由器 B,该路由器检查所请求对象的名称;它找到了一条通往声称拥有可通过 C 访问的信息的服务器的路径,因此它将请求发送到 C。
2. The host operating system sends this request to its upstream router, B, which examines the name of the object requested; it finds a path to a server claiming to have this information that is reachable via C, so it sends the request to C.
3. B再次查询所请求对象的名称,发现有经过E的路径,因此将请求转发给E。
3. B again consults the name of the object requested and finds it has a path through E, so it forwards the request to E.
4. E查询所请求对象的名称,发现有经过F的路径,因此将请求转发给F。
4. E consults the name of the object requested and finds it has a path through F, so it forwards the request to F.
5. F查询其本地信息存储并发现其在指定位置有该对象的副本;它将对象的加密副本返回给 E。
5. F consults its local information store and finds it has a copy of this object in the location specified; it returns an encrypted copy of the object to E.
6. E 存储加密对象的本地副本,检查对该对象的请求所经过的路径,并将该对象的副本发送给 C。
6. E stores a local copy of the encrypted object, examines the path through which the request for this object came, and sends a copy of the object to C.
7. 同样,C 存储加密对象的本地副本,检查对该对象的请求所经过的路径,并将该对象的副本发送给 B。
7. C, likewise, stores a local copy of the encrypted object, examines the path through which the request for the object came, and sends a copy of the object to B.
8. B stores a copy of the encrypted object and sends it to A.
9. A,收到对象后,现在必须找到某种方法来解密加密的对象;为此,它要么联系第三方安排金融交易,要么使用它已经存储的本地信息来解密对象。
9. A, on receiving the object, now must find some way to unencrypt the encrypted object; to do this, it either contacts a third party to arrange a financial transaction or uses local information it has already stored to unencrypt the object.
NDN 版本乍一看似乎复杂得多,但它确实有几个优点。例如:
The NDN version seems far more complex at first blush, but it does have several advantages. For instance:
• 对象本身被加密,而不是加密或强化客户端 (A) 和服务器 (F) 之间的会话;这意味着整个网络中每个对象都有保护。任何特定设备的内存是否受到损害并不重要,因为每个对象在通过网络传输时都会被加密。
• Rather than encrypting or hardening the session between the client (A) and the server (F), the object itself is encrypted; this means there is per object protection throughout the entire network. It does not matter if the memory of any particular device is compromised, because every object is encrypted as it is carried over the network.
• 有关对象的元数据是(或可以)公开的,允许每个设备根据本地策略处理数据,包括“该用户为更高速的服务支付了更多费用”等。
• The metadata about the object is (or can be) exposed, allowing each device to handle the data according to local policy, including “this user paid more for higher-speed service,” etc.
• 整个网络充当分布式数据库;如果第二个用户请求相同的信息,则该请求将被路由到本地路由表指示可以找到该信息的位置,就像 IP 数据包一样。然而,如果在到达始发服务器之前遇到该信息,则可以返回该信息。由于所有对象均已加密,因此按要求返回信息几乎没有危险;请求者必须弄清楚如何解密和使用数据。此外,加密方案可以包括某种形式的时间和日期戳,因此一旦有新版本可用,过时的信息就会被丢弃。
• The entire network acts as a distributed database; if a second user requests this same information, the request is routed toward where the local routing tables indicate the information can be found, as with an IP packet. However, if the information is encountered before the originating server is reached, the information can be returned. As all the objects are encrypted, there is little danger in returning the information as requested; the requestor must figure out how to unencrypt and use the data. Further, the encryption scheme can include some form of time and date stamp, so out-of-date information is discarded once a new version is available.
• 由于信息被传递,而不是数据包,并且每个对象都被加密,因此对象的源和目的地几乎没有意义(除非是特定请求和答复系列的情况)。
• Since information is being passed around, rather than packets, and each object is encrypted, the source and destination of the objects is pretty much meaningless (except in the case of a specific request and reply series).
• 由于信息源在路由方面不再真正相关,因此这可以将较小的信息源与较大的信息源置于平等的地位。
• Since the source of the information is no longer really relevant in routing terms, this could place smaller information sources on an equal footing with larger ones.
当然,这种系统也有许多挑战需要克服。例如:
There are, of course, many challenges to overcome in this kind of system as well. For instance:
• 如今,网络转发设备并非设计用于以这种方式存储和转发信息。构建能够以这种方式存储和转发信息的系统会给大型提供商带来重大负担,他们需要重建他们的网络,并考虑如何基于收费任何特定用户请求的数据量,从而导致其网络中的中间存储。这可以通过降低从其他人都想要的网络中获取信息的成本来重塑整个互联网经济。例如,如果您要求Pleasant Valley Sunday的特定版本,网络可能会建议另一个版本,甚至是另一首本地已经可用的歌曲,从而提高网络中的存储效率。这个过程可以压制不太受欢迎的内容,就像今天主要集中的内容提供商所做的那样。
• Network forwarding devices are not, today, designed to store and forward information in this way. Building systems able to store and forward information in this way would place a major burden on large-scale providers, who would need to rebuild their networks, and think about how to charge based on the amount of data any particular user has requested, resulting in intermediate storage in their network. This could reshape the entire economy of the Internet by making it cheaper to always pull information from the network everyone else already wants. For instance, if you asked for a particular version of Pleasant Valley Sunday, the network might suggest another version, or even another song, which is already locally available, increasing the efficiency of the storage in the network. This process could squelch out less popular content in much the same way as the largely centralized content providers do today.
• 似乎很难理解流媒体服务如何在这种网络中运行。也许可用的最佳网络是一个既具有数据包传送系统属性又具有 NDN 所考虑的基于内容的网络类型的网络。
• It seems hard to understand how streaming services might work in this kind of network. Perhaps the best network available would be one with attributes of both the packet delivery systems and the kind of content-based networks the NDN contemplates.
• 网络的性能似乎难以理解或规划。您正在寻找的信息可能在附近,也可能在很远的地方;即使它在附近,网络设备也可能会陷入服务许多其他请求的困境,因此它们无法立即服务您的请求。在这种网络中,需要彻底重新思考服务质量 (QoS),直至 QoS 本身的含义。
• The performance of the network would seem to be difficult to understand or plan for. Information you are looking for might be close by or far away; even if it is close by, network devices might be bogged down servicing a lot of other requests, so they cannot service your request immediately. Quality of Service (QoS) would need to be completely rethought, down to the meaning of QoS itself, in this kind of network.
NDN 似乎不会成为一种常用技术,但它可以作为对一种非常相似的技术的有用介绍,该技术将对信息技术世界产生巨大影响:区块链。
It does not seem as though NDN will become a commonly used technology, but it serves as a useful introduction to a very similar technology poised to have a large impact on the information technology world: blockchains.
要理解区块链,必须从哈希开始。哈希是一个简单的概念,但很难以有用的方式实现:哈希采用任意大小的数字字符串并返回固定长度的数字或哈希,(或多或少)唯一地表示原始字符串。易于实现的部分是这样的:一个相当幼稚的哈希是将一组数字中的数字相加,直到达到一个数字,将结果称为哈希。例如:
To understand blockchains, you must begin with the hash. A hash is a simple concept that is quite difficult to implement in a useful way: a hash takes a string of numbers of any size and returns a fixed length number, or hash, (more or less) uniquely representing the original string. The simple-to-implement part is this: one rather naïve hash is to add the digits in a set of numbers until you reach a single digit, calling the result the hash. For instance:
23523
2 + 3 + 5 + 2 + 3 == 15
1 + 5 == 6
23523
2 + 3 + 5 + 2 + 3 == 15
1 + 5 == 6
因此,数字 23523 可以表示为 6。散列的一个奇怪的特性是无法从散列确定原始数字是什么——这是散列许多用途的基本观察之一。如果我与某个第三方共享一个号码,然后该第三方又与您共享该号码,您可以向我索要数字的哈希值(不告诉我实际数字是什么!),并且您可以通过验证我给您的哈希值与您计算的哈希值是否匹配来验证您拥有的数字是否相同。
Hence, the number 23523 can be represented as 6. One curious property of the hash is there is no way to determine, from the hash, what the original number was— this is one of the essential observations of many uses for the hash. If I share a number with some third party, and that party then shares it with you, you can ask me for the hash of the number (without telling me what the actual number is!), and you can verify the number you have is the same by verifying the hash that I give you matches the one you calculate.
前面的哈希很幼稚,因为它太容易获得冲突。换句话说,在给定相同的过程的情况下,有许多不同的数字集将产生 6 的哈希值,例如 222、33、111111 和(可能)几乎无限数量的其他数字。那么,构建哈希的棘手部分是确保冲突很少或不存在。
The preceding hash is naïve because it is too easy to obtain a collision. In other words, there are many different sets of numbers that will result in a hash of 6 given the same process, such as 222, 33, 111111, and (probably) an almost infinite number of others. The tricky part of building a hash, then, is in ensuring collisions are rare or nonexistent.
假设您已经开发了这样的哈希(有很多个),那么您可以使用哈希来构建Merkle Tree ,如图30-4所示。
Assuming you have developed such a hash (there are a number of them), you can then use hashes to build a Merkle Tree, as illustrated in Figure 30-4.
在图 30-4中,四个数字已通过算法处理以生成散列:H1 到 H4。H1 和 H2 依次进行哈希处理以生成 H5,H3 和 H4 进行哈希处理以生成 H6。H5 和 H6 依次被散列以产生根散列。Merkle 树有很多有趣的地方;例如,如果您出于任何原因更改了 H1 的值,则根哈希的值也会更改。当然,这“很有道理”,但这意味着您可以通过检查单个值来验证任何文件组或值的内容。此外,如果您有权访问树中的所有哈希值,即使您不知道或不知道是否可以信任这些值本身,您也可以验证更改发生在树的哪一部分。
In Figure 30-4, four numbers have been processed through an algorithm to produce a hash: H1 through H4. H1 and H2 are, in turn, hashed to produce H5, and H3 and H4 are hashed to produce H6. H5 and H6 are, in turn, hashed to produce the root hash. There are a number of interesting things about Merkle trees; for instance, if you change the value of H1 for any reason, the value of the root hash also changes. Of course, this “just makes sense,” but it means you can validate the contents of any group of files or values by examining a single value. Further, you can verify which part of the tree the change has taken place on if you have access to all the hashes in the tree even though you do not know, or do not know if you can trust, the values themselves.
要进入区块链,您需要将 Merkle 树串起来,如图30-5所示。这里,H1 和 H5 的哈希值被哈希形成 H2,H2 和 H6 的哈希值被哈希形成 H3,等等。区块链的有趣之处在于,你可以判断任何步骤是否重复了两次,如果工作没有重复。已完成,或者链前一部分中的任何数字是否已更改 - 因此它在形成数字货币方面很有用。
To get to a blockchain, you string the Merkle tree out, as shown in Figure 30-5. Here the hashes of H1 and H5 are hashed to form H2, the hashes of H2 and H6 are hashed to form H3, etc. What is interesting about a blockchain is that you can tell if any step has been repeated twice, if work has not been done, or if any of the numbers in the previous part of the chain have been changed—hence its usefulness in forming digital currencies.
事实上,真正的区块链远不止这些;还有一个共识过程。这种描述是一种彻底的简化。
There is, in fact, more to a real blockchain than this; there is also a consensus process. This description is a radical simplification.
一旦有了区块链,你能用它做什么?还记得前面描述的 NDN 的概念吗?现在考虑一下:如果该区块链上的每个块都是一个对象,如 NDN 网络中所描述的那样,会怎样?应该可以使用散列本身的信息来遍历树,以找到您正在寻找的对象。即使该对象有较新版本,旧版本仍应以加密形式存在,以便您可以比较该对象的每个版本,直至其创建时为止。修改区块链中包含的任何对象后,如果不使每个对象的加密失效,就无法更改该对象。
Once you have a blockchain, what can you do with it? Remember the concept of the NDN described earlier? Now consider: what if every block on this blockchain were an object, as described in the NDN network? It should be possible to traverse the tree, using the information from the hash itself, to find the object you are looking for. Even if there is a newer version of the object, the older version should still exist, in its encrypted form, allowing you to compare every version of the object all the way back to its creation. There is no way to change any of the objects contained in the blockchain without invalidating the encryption on every object after this one has been modified.
加密货币利用这些属性,允许用户跨时间在区块链上进行交易。如果不使整个区块链失效,则任何交易都无法撤消;区块链存在许多副本,因此单个副本失效应该会导致参与区块链的整个设备网络快速丢弃失效副本。
Cryptocurrencies take advantage of these properties to allow users to place transactions on the blockchain across time. No transaction can be undone without invalidating the entire blockchain; there are many copies of the blockchain in existence, so a single copy being invalidated should cause the entire network of devices participating in the blockchain to quickly discard the invalidated copy.
其他区块链系统(例如以太坊)超越了加密货币的概念,允许将可执行代码与交易一起存储在区块链中。这意味着可以为虚拟机提供一个以太坊区块链,该区块链不仅包含要操作的数据(例如将一定数量的钱从一个帐户转移到另一个帐户),还包含有关在什么条件下应对数据进行操作的一些说明(当收件人签收包裹时)。操作可以在完全公开的情况下进行,但涉及的人员、帐号等信息不会暴露在公众的视野中(因为这些都可以用哈希值来表示,而不是真实的数字,只能是由交易各方解释)。
Other blockchain systems, such as Ethereum, go beyond the idea of a cryptocurrency by allowing executable code to be stored in the blockchain alongside transactions. This means a virtual machine can be given an Ethereum blockchain that contains not only data to operate on (such as move some amount of money from one account to another account), but also some instructions about under what conditions the data should be acted on (when the receiver signs for the package). The operation could take place in full public view, but without information about the people involved, account numbers, etc., being exposed to the public view (because these can all be represented by hashes, instead of the real numbers, that can only be interpreted by the parties involved in the transaction).
理论上,像以太坊这样的区块链系统可以在整个公共互联网上提供覆盖,提供与 NDN 的创建者最初构想的相同类型的系统。
A blockchain system like Ethereum could, in theory, provide an overlay on the entire public Internet, providing the same sort of system as the creators of the NDN originally conceived.
对于大多数工程师来说,互联网是永恒的。协议保持不变,虽然提供商不时变换角色,或者一个提供商购买另一个提供商,但整个互联网几乎没有明显的变化。然而,这并不是现实的世界观。图 30-6说明了互联网自商业化最初几年以来是如何构建的。
The Internet is, to most engineers, a constant. The protocols remain the same, and while the providers shift roles from time to time, or one provider buys another, there is very little apparent change in the Internet as a whole. This, however, is not a realistic view of the world. Figure 30-6 illustrates how the Internet has been built since the first few years of its commercialization.
这种形式显然将大型交通提供商置于核心地位。传输提供商提供的 QoS 和安全保护规定了用户发送或接收流量的速度。如果您想启动新的内容或边缘提供商网络或服务,您可以连接到传输提供商并覆盖几乎所有连接到互联网的人。在撰写本文之前的五年左右发生的事情是内容和边缘提供商连接方式的转变。新的连接模式如图 30-7所示。
This shape clearly puts the large-scale transit providers in a central role. The QoS and security protections offered by the transit providers regulate how quickly any user can send or receive traffic. If you want to start a new content or edge provider network or service, you can connect to the transit providers and reach pretty much everyone who connects to the Internet. What has been happening in the five years or so before this writing is a shift in the way content and edge providers are connected. The new connection pattern is illustrated in Figure 30-7.
内容提供商发现了一个简单的事实:内容加载的速度推动了用户参与度,而用户参与度又推动了收入。为了使页面加载速度更快,内容提供商需要“更接近”用户。更紧密本质上意味着尽可能切断交通提供商并直接连接到边缘提供商。这意味着全球互联网正在慢慢从对等网络的网格转向中心辐射模式,大型内容提供商位于中心,边缘提供商充当辐射。
The content providers have discovered a simple fact: the speed at which their content loads drives user engagement, and user engagement drives revenue. To make their pages load faster, the content providers need to be “closer” to their users. Being closer essentially means cutting out transit providers wherever possible and connecting directly to the edge providers. This means the global Internet is slowly moving away from being a mesh of peer networks to a more hub-and-spoke pattern, with large content providers in the hub, and edge providers acting as the spokes.
There is little sense of what this means in the long term. For instance, it could mean
• 互联网最终将碎片化,您可以访问的内容由您连接到的边缘提供商决定(因为并非每个边缘提供商都会连接到每个内容提供商)。
• The Internet will eventually fragment, with the content you can reach being determined by the edge provider you connect to (because not every edge provider will connect to every content provider).
• 交通提供商可能会缩减规模,但不会最终消亡,从而允许完全连接,但提供两种服务类别;大型内容提供商将很快就能到达,而较小的和较新的内容提供商将被迫采取缓慢的路径。
• The transit providers could shrink, but not ultimately die off, allowing full connectivity, but with two classes of service; large content providers will be quickly reachable, while smaller and newer ones will be forced to take the slow path.
第二件事似乎已经发生了。这种“慢速路径/快速路径”安排的最终效果是,在全球互联网上启动新的内容服务变得越来越困难,随着时间的推移,这将越来越多的力量集中到较小的玩家群体中。这种趋势是否会持续下去,或者最终的结局是健康的对于整个互联网或依赖于互联网的网络工程和更大的信息技术生态系统来说,目前还很难说。
The second already appears to be happening. The ultimate effect of this “slow path/fast path” arrangement is that it becomes ever more difficult to start a new content service on the global Internet, which drives ever more power into a smaller group of players over time. Whether this trend will continue, or the ultimate end is healthy for the Internet as a whole or the network engineering and larger information technology ecosystems reliant on the Internet, is hard to say at this point.
但这无疑是在看待网络工程的未来时值得考虑的趋势之一。
But this is certainly one of those trends worth factoring into any view of what the future of network engineering might look like.
当下,人们常常觉得世界变化太快,无法跟上,网络工程的未来一片黯淡。对于网络工程领域的某些部分来说,这可能是正确的;旧技术最终会消亡,而其他技术会出现并取代它们(或者也许该技术旨在解决的整个问题由于某种原因不再存在)。然而,尽管如此,始终需要训练有素、深思熟虑的工程师,他们了解基本问题以及这些问题可用的解决方案的范围。对于更了解技术的工程师来说,他们可以在正确的时间提出正确的问题来改变企业的运营方式,网络工程永远有光明的未来。
It often seems, in the present moment, like the world is changing too fast, there is no way to keep up, and the future of network engineering is bleak. There are some parts of the network engineering world for which this is likely true; old technologies do, ultimately, die, and others come to the front to take their place (or maybe the entire problem that the technology was designed to solve no longer exists for some reason). Through all of this, however, there will always be a need for well-trained, thoughtful engineers who understand the basic problems, and the scope of solutions available for those problems. For engineers who understand the technology at a more basic level, and hence can ask the right questions at the right time to make a difference in the way a business runs, there will always be a bright future in network engineering.
如果您已经阅读了本文,研究了示例,并花时间思考了此处介绍的技术,那么您至少已经开始走上发展所需技能的道路,成为永远需要的工程师之一。
If you have read this far, studied the examples, and spent time thinking through the technologies as they have been presented here, you are at least starting on the road toward developing the skills needed to be one of those engineers who will always be in demand.
比约克伦德,马丁. YANG 1.1 数据建模语言。征求意见 7950。RFC 编辑器,2016。https: //rfc-editor.org/rfc/rfc7950.txt。
Bjorklund, Martin. The YANG 1.1 Data Modeling Language. Request for Comments 7950. RFC Editor, 2016. https://rfc-editor.org/rfc/rfc7950.txt.
卡伦、罗斯. 十二个网络真理。征求意见 1925。RFC 编辑,1996。doi:10.17487/RFC1925。
Callon, Ross. The Twelve Networking Truths. Request for Comments 1925. RFC Editor, 1996. doi:10.17487/RFC1925.
“以太坊家园文档。” 访问日期:2017 年 8 月 30 日。http ://www.ethdocs.org/en/latest/。
“Ethereum Homestead Documentation.” Accessed August 30, 2017. http://www.ethdocs.org/en/latest/.
盖茨、马克. 区块链:了解区块链、比特币、加密货币、智能合约和货币未来的终极指南。CreateSpace独立出版平台,2017。
Gates, Mark. Blockchain: Ultimate Guide to Understanding Blockchain, Bitcoin, Cryptocurrencies, Smart Contracts and the Future of Money. CreateSpace Independent Publishing Platform, 2017.
休斯顿、杰夫. “交通的死亡?” APNIC 博客,2016 年 10 月 28 日。https: //blog.apnic.net/2016/10/28/the-death-of-transit/。
Huston, Geoff. “The Death of Transit?” APNIC Blog, October 28, 2016. https://blog.apnic.net/2016/10/28/the-death-of-transit/.
“HyperConverged.org。” 访问日期:2017 年 8 月 30 日。http ://www.hyperconverged.org/。
“HyperConverged.org.” Accessed August 30, 2017. http://www.hyperconverged.org/.
松本,克雷格。“为什么机器学习很难应用于网络。” 博客。SDxCentral,2017 年 1 月 2 日。https ://www.sdxcentral.com/articles/news/machine-learning-hard-apply-networking/2017/01/。
Matsumoto, Craig. “Why Machine Learning Is Hard to Apply to Networking.” Blog. SDxCentral, January 2, 2017. https://www.sdxcentral.com/articles/news/machine-learning-hard-apply-networking/2017/01/.
西奥博尔德,奥利弗。绝对初学者的机器学习:简单的英语介绍。独立出版。
Theobald, Oliver. Machine Learning for Absolute Beginners: A Plain English Introduction. Independently published.
“什么是以太。” 访问日期:2017 年 8 月 30 日。https ://www.ethereum.org/ether。
“What Is Ether.” Accessed August 30, 2017. https://www.ethereum.org/ether.
怀特、拉斯. “交通的消亡:需要防止碎片化。” 访问日期:2017 年 8 月 30 日。http ://www.circleid.com/posts/20161107_death_of_transit_need_to_prevent_fragmentation/。
White, Russ. “Death of Transit: A Need to Prevent Fragmentation.” Accessed August 30, 2017. http://www.circleid.com/posts/20161107_death_of_transit_need_to_prevent_fragmentation/.
张丽霞、黛博拉·埃斯特林、杰弗里·伯克、范·雅各布森、詹姆斯·D·桑顿、戴安娜·K·斯迈特斯、张北川等。“命名数据网络 (NDN) 项目”,2010 年 10 月 31 日。http: //named-data.net/techreport/TR001ndn-proj.pdf。
Zhang, Lixia, Deborah Estrin, Jeffrey Burke, Van Jacobson, James D. Thornton, Diana K. Smetters, Beichuan Zhang, et al. “Named Data Networking (NDN) Project,” October 31, 2010. http://named-data.net/techreport/TR001ndn-proj.pdf.
1. 哪些类型的网络资源可以像超融合解决方案中的计算资源一样进行池化?
1. What kinds of network resources might be pooled like compute resources in a hyperconverged solution?
2. OpenConfig YANG模型与IETF标准化的YANG模型有什么区别?
2. What is the difference between OpenConfig YANG models and the YANG models standardized by the IETF?
3. 回顾实施和部署基于意图的网络面临的一些挑战。
3. Review some of the challenges to implementing and deploying intent-based networking.
4. 机器学习在网络工程中的哪些用途可能有用?
4. Where might machine learning be useful in network engineering?
5. 文本使用什么论点来解释为什么机器学习永远不能用于配置网络?
5. What argument does the text use to explain why machine learning may never be used to configure networks?
6. 命名数据网络相对于基于数据包的网络有什么优势?
6. What is the advantage of Named Data Networking over packet-based networks?
7.研究以太坊。具有嵌入式操作的区块链如何需要路由?
7. Research Ethereum. How might blockchains with embedded actions require routing?
8. 一些工程师认为,对于自动化来说,最好有一个通用的建模语言,而不是一组通用的模型。你认为他们的论点可能是什么?
8. Some engineers argue it is better to have a common modeling language, rather than a common set of models, for automation. What do you think their line of argument might be?
1 . 卡伦,十二个网络真理,1。
1. Callon, The Twelve Networking Truths, 1.
2 . Matsumoto,“为什么机器学习很难应用于网络。”
2. Matsumoto, “Why Machine Learning Is Hard to Apply to Networking.”
8B10B encoding scheme (Gigabit Ethernet), 99–100
AAA 系统,567
AAA systems, 567
ABR(可用比特率),70
ABR (Available Bit Rates), 70
access control, 567–568, 739–740
accuracy in network modeling (network troubleshooting), 637–638
主动网络,25
Active Networking, 25
广告
advertisements
AF(保证转发)、QoS、202
AF (Assured Forwarding), QoS, 202
aggregating IPv6 addresses, 124–127
算法
algorithms
Bellman-Ford 无环路路径计算为,324 – 325
Bellman-Ford loop-free path calculation as, 324–325
历史, 342
history of, 342
disjoint path algorithms, 356–357
Suurballe’s disjoint path algorithms, 358–363
two-connected networks, 357–358
双,330
DUAL, 330
发展, 331
development of, 331
贪心算法,317
greedy algorithms, 317
多重度量问题,356
multiple metric problem, 356
Suurballe’s disjoint path algorithms, 358–363
alternate loop-free paths, 317–319
P/Q 空间模型, 321 – 322 , 352 – 353
P/Q Space model, 321–322, 352–353
waterfall (continental divide) model, 320–321
放大攻击,574
amplification attacks, 574
ANI (Artificial Network Intelligence), 769–771
API(应用程序编程接口)
API (Application Programming Interfaces)
自动化,763
automation, 763
云计算,725
cloud computing, 725
应用层
application layer
四层 DoD 模型,78
four-layer DoD model, 78
OSI 模型,83
OSI model, 83
应用
applications
disaggregation of, 658–659, 662
impact of network failures, 616, 617
数据流控制,616
data flow control, 616
丢弃数据包,617
dropped packets, 617
重复数据包,617
duplicate packets, 617
端到端延迟,616
end-to-end delays, 616
抖动,616
jitter, 616
无序数据包,617
out-of-order packets, 617
optimizing via flow pinning, 468–473, 478
ARP (Address Resolution Protocol), interlayer discovery, 159–161
阿帕网,分组交换网络,13
ARPANET, packet switched networks, 13
ASIC(专用集成电路)
ASIC (Application-Specific Integrated Circuits)
资产(安全),定义,565
assets (security), defining, 565
非对称密码学,260
asymmetric cryptography, 260
异步模式(BFD),381
Asynchronous mode (BFD), 381
带回显的异步模式 (BFD),381
Asynchronous mode with echo (BFD), 381
ATM(异步传输模式),17
ATM (Asynchronous Transfer Mode), 17
协商比特率、流量控制、69
negotiated bit rates, flow control, 69
atomic aggregation (BGP), 542–543
攻击面,定义,565
attack surfaces, defining, 565
攻击者(威胁行为者),定义,564
attackers (threat actors), defining, 564
攻击(威胁)
attacks (threats)
放大攻击,574
amplification attacks, 574
暴力攻击,258
brute-force attacks, 258
燃烧器攻击,574
burner attacks, 574
DDoS scrubbers/services, 581–582
防止、阻止半开放/格式错误的会话,575
preventing, blocking half-open/malformed sessions, 575
preventing, dispersing traffic over multiple servers, 576–577
preventing, filtering unroutable addresses, 578–579
防止、主机操作系统修改,575
preventing, host operating system modifications, 575
preventing, rate limiting, 575–576
定义, 564
defining, 564
man-in-the-middle attacks, 268–269
反射攻击,574
reflection attacks, 574
音频流和 BLE,752
audio streaming and BLE, 752
API,763
API, 763
自动化工程师,681
automation engineers, 681
机载自动化,694
on-box automation, 694
命令行界面, 681 – 682 , 684 , 763 , 763
第680章
complexity and, 680
controller-based automation, 695–696
数据分析,697
data analytics, 697
deployment automation, 696–697
期待脚本,682
Expect scripting, 682
infrastructure automation tools, 694–695
机器学习,697
machine learning, 697
网络会议,685
NETCONF, 685
配置, 686
configuring, 686
数据存储,685
data stores, 685
管理站,686
management stations, 686
运营, 687
operations, 687
YANG data modeling language and, 687–689
pervasive network automation, 763, 765
木偶组件/清单,自动化,695
puppet components/manifests, automation, 695
正则表达式,681
regular expressions, 681
休息会议,689
RESTCONF, 689
可用性,619
availability, 619
带宽
bandwidth
计算带宽,历史,21
computing bandwidth, history of, 21
流量控制
flow control
circuit switched networks, 15–16
packet switched networks, 15–16
有效产出与吞吐量,110
goodput versus throughput, 110
榕树藤,16
Banyan Vines, 16
Baran 和分组交换网络,Paul,13
Baran and packet switched networks, Paul, 13
BCP (Best Current Practices), open networking security, 749–750
波束成形、无线802.11 复用、104 – 105、106 – 107
beamforming, Wireless 802.11 multiplexing, 104–105, 106–107
Bellman-Ford 无环路路径计算,324
Bellman-Ford loop-free path calculation, 324
跨样本网络的循环,330
cycles across sample networks, 330
边缘, 326
edges, 326
负成本边际,330
negative cost edges, 330
拓扑结构,326
topologies, 326
BFD(双向转发检测)
BFD (Bidirectional Forwarding Detection)
异步模式,381
Asynchronous mode, 381
带回显的异步模式,381
Asynchronous mode with echo, 381
需求模式,382
Demand mode, 382
带回显的需求模式,382
Demand mode with echo, 382
BGP(边界网关协议),458
BGP (Border Gateway Protocol), 458
control plane policies, 474–476
公共云,736
public clouds, 736
RINA 型号,84
RINA model, 84
路线反射器,457
route reflectors, 457
biometrics, security issues, 562–564
比特率
bit rates
阿伯,70
ABR, 70
社区康复,69
CBR, 69
negotiated bit rates, flow control, 69–70
可变BR,69
VBR, 69
BLE(蓝牙低功耗)
BLE (Bluetooth Low Energy)
音频流,752
audio streaming, 752
连接间隔,752
connection intervals, 752
从机延迟,752
slave latency, 752
区块链
blockchains
Ethereum blockchain system, 777–778
blocking DDoS attacks upstream, 579–580
蓝牙、低功耗蓝牙
Bluetooth, BLE
音频流,752
audio streaming, 752
连接间隔,752
connection intervals, 752
从机延迟,752
slave latency, 752
botnets, DDoS reflection attacks, 572–574
BPDU, STP and neighbor discovery, 406–407
暴力攻击,258
brute-force attacks, 258
缓冲数据包
buffering packets
TCP,217
TCP, 217
UDP,217
UDP, 217
燃烧器攻击,574
burner attacks, 574
Byrd, Col. John, OODA loops, 582–583
caching, control plane information, 520–525
CAPEX (Capital Expenses), public clouds, 726–727
carrier loss, event-driven failure detection, 379–380
CBR(恒定比特率),69
CBR (Constant Bit Rates), 69
CBWFQ(基于类的加权公平队列),QoS 和拥塞管理,212 – 214
CBWFQ (Class-Based Weighted Fair Queuing), QoS and congestion management, 212–214
cells, fixed cell sizes (ATM), 18–20
信任的中心来源,262
central source of trust, 262
集中控制平面, 25 , 398 , 482 , 483 , 503
centralized control planes, 25, 398, 482, 483, 503
增强模型,483
augmented model, 483
分布式模型,483
distributed model, 483
混合动力车型,484
hybrid model, 484
微循环,390
microloops, 390
parts of/division of labor, 484–485
更换型号,484
replace model, 484
Cerf 和 TCP,Vint,16
Cerf and TCP, Vint, 16
change distribution, 383, 394–395
centralized control planes, 389–390
淹没分布
flooded distributions
网络设备之间的洪泛,383
flooding between network devices, 383
hop-by-hop distributions, 387–389
channel sharing (multiplexing), 107–108
chipsets (Ethernet), 95–96, 98
密码块、传输安全
cipher blocks, transport security
cipher blocks as substitution tables, 253–255
substitution tables generated by large key transforms, 255–258
电路损耗,公共云,736
circuit lossiness, public clouds, 736
circuit switched networks, 12–13
优点, 11
advantages of, 11
缺点,11
disadvantages of, 11
packet switched networks versus, 13–15
数据(转发)平面,12
data (forwarding) planes, 12
管理平面,12
management planes, 12
CLI(命令行界面)
CLI (Command-Line Interface)
自动化, 681 – 682 , 684 , 763 , 763
automation, 681–682, 684, 763, 763
云计算,725
cloud computing, 725
CLNS(无连接模式网络服务),16
CLNS (Connectionless Mode Network Service), 16
clocking packets from memory, packet switching, 190–191
API,725
API, 725
命令行界面,725
CLI, 725
定义, 723
defining, 723
FaaS,724
FaaS, 724
混合云,725
hybrid clouds, 725
基础设施即服务,724
IaaS, 724
平台即服务,724
PaaS, 724
私有云,725
private clouds, 725
边界网关协议,736
BGP, 736
业务敏捷性,727
business agility, 727
电路损耗,736
circuit lossiness, 736
云交换服务,734
cloud exchange services, 734
费用, 730
costs of, 730
数据引力,735
data gravity, 735
data protection over public clouds, 737–738
加密,738
encryption, 738
HTTPS,738
HTTPS, 738
基础设施设计,729
infrastructure design, 729
基础设施故障,729
infrastructure failures, 729
IPSec,738
IPSec, 738
抖动,736
jitter, 736
managing secure connections, 738–739
监控云网络,740
monitoring cloud networks, 740
multiple Internet connection, 735–737
多租户云,739
multitenant clouds, 739
非技术权衡,728
nontechnical tradeoffs, 728
operational tradeoffs, 728–731
上市时间,727
time-to-market, 727
软件即服务,724
SaaS, 724
安全,737
security, 737
data protection over public clouds, 737–738
加密,738
encryption, 738
HTTPS,738
HTTPS, 738
IPSec,738
IPSec, 738
managing secure connections, 738–739
监控云网络,740
monitoring cloud networks, 740
故障排除、基础设施故障、729
troubleshooting, infrastructure failures, 729
CoDel (Controlled Delays), buffering packets, 217–218
cold potato routing, control plane policies, 464–466
通信系统、数字语法
communications systems, digital grammars
dictionaries, protocols as, 40–47
error management, 38, 39, 47–55
反馈循环,64
feedback loops, 64
协议
protocols
定义, 40
defining, 40
字典, 42
dictionaries, 42
灵活性,40
flexibility, 40
元数据权衡,40
metadata tradeoffs, 40
优化, 40
optimizing, 40
资源效率,40
resource efficiency, 40
shared object dictionaries, 46–47
complexity (network), 25–26, 599
自动化,以及,680
automation and, 680
control plane policies, 474–476
事件驱动的故障检测,380
event-driven failure detection, 380
网络延伸
network stretch
control plane state versus, 28–29
定义, 28
defining, 28
原因, 26
reasons for, 26
权衡,33
tradeoffs, 33
components (networks), defining, 631–632
composable systems, network design, 657–658
压缩(存储),657
compression (storage), 657
计算带宽,历史,21
computing bandwidth, history of, 21
计算内存的历史,21
computing memory, history of, 21
计算能力,历史,21
computing power, history of, 21
拥塞
congestion
network path choke points, 196–197
QoS 和拥塞管理,207
QoS and congestion management, 207
大象流,214
elephant flows, 214
过度拥堵,214
overcongestion, 214
警务,215
policing, 215
流量整形,214
traffic shaping, 214
连接间隔、数据交换、752
connection intervals, data exchanges, 752
面向连接的协议,86
connection-oriented protocols, 86
无连接协议,86
connectionless protocols, 86
contention, crossbar fabrics, 188–189
continental divide (waterfall) model, alternate loop-free paths, 320–321
控制平面
control planes
集中控制平面, 25 , 398 , 482 , 483 , 503
centralized control planes, 25, 398, 482, 483, 503
增强模型,483
augmented model, 483
分布式模型,483
distributed model, 483
混合动力车型,484
hybrid model, 484
微循环,390
microloops, 390
parts of/division of labor, 484–485
更换型号,484
replace model, 484
收敛过程,374
convergence process, 374
distributed control planes, 14–15, 398–399
距离矢量协议,399
distance vector protocols, 399
链路状态协议,399
link state protocols, 399
路径矢量协议,399
path vector protocols, 399
误报,375
false positives, 375
信息隐藏,526
information hiding, 526
aggregating reachability information, 515–518
BGP atomic aggregation, 542–543
BGP reachability overlay, 544–546
caching control plane information, 520–525
control plane state scope, 508–510
filtering reachability information, 518–519
positive feedback loops, 510–513
slowing down state velocity, 525–526, 548–554
解空间,513
solution space, 513
SR with controller overlay, 546–548
summarizing topology information, 514–515, 530
summarizing topology information, IS-IS, 530–535
summarizing topology information, OSPF, 535–542
信息过载,375
information overload, 375
无环路径
loop-free paths
MST, 317
MST, 317
multiple overlay control planes, interaction surfaces, 476–478
网络图,283
network diagrams, 283
advertising reachability/topologies, 295–298
proactive distribution of reachability, 300–302
redistribution of reachability/topologies, 303–307
三次握手,291
three-way handshakes, 291
mashed potato routing, 464–466
multiple overlay control planes, interaction surfaces, 476–478
resource segmentation, 466–468, 476
traffic engineering in data center fabrics, 470–473
traffic flow optimization, 473–474
正反馈循环,375
positive feedback loops, 375
特种部队,317
SPT, 317
controller-based automation, 695–696
融合网络,655
converged networks, 655
收敛过程(控制面板),374
convergence process (control panel), 374
corporate networks and VPN, 227–229
correcting errors (error management), 53–54
costs, network design, 592–593
CRC (Cyclical Redundancy Checks), 49–53, 55
横杆织物
crossbar fabrics
密码学
cryptography
非对称密码学,260
asymmetric cryptography, 260
cryptographic functions, 255, 258, 259
密钥交换,261
key exchanges, 261
信任的中心来源,262
central source of trust, 262
公钥基础设施,262
PKI, 262
private key cryptography, 262–263
传递信任,262
transitive trust, 262
信任网,262
web of trust, 262
私钥密码学
private key cryptography
public key cryptography versus, 260–261
公钥密码学
public key cryptography
private key cryptography versus, 260–261
对称密码学,260
symmetric cryptography, 260
CSMA/CD(载波侦听多路访问/冲突检测)、以太网、93、98
CSMA/CD (Carrier Sense Multiple Access/Collision Detection), Ethernet, 93, 98
CUBIC、QUIC 重传,138
CUBIC, QUIC retransmissions, 138
CWND(拥塞窗口),TCP 窗口流量控制,133 – 134
CWND (Congestion Window), TCP windowed flow control, 133–134
DAD(重复地址检测)
DAD (Duplicate Address Detection)
误报,解决,163
false positives, resolving, 163
IPv4 寻址,160
IPv4 addressing, 160
IPv6 寻址,162
IPv6 addressing, 162
数据(转发)平面(TDM 系统),12
data (forwarding) planes (TDM systems), 12
数据分析,697
data analytics, 697
data center fabrics, traffic engineering, 470–473
数据中心防火墙集群、虚拟网络、702
data center firewall clusters, virtual networks, 702
重复数据删除和存储,657
data deduplication and storage, 657
数据交换、连接间隔、752
data exchanges, connection intervals, 752
数据排放, 251 – 252 , 264 – 265 , 571
data exhaust, 251–252, 264–265, 571
数据流控制,网络故障的应用程序影响,616
data flow control, application impacts of network failures, 616
数据引力,公共云,735
data gravity, public clouds, 735
数据链路层(OSI 模型),82
data link layer (OSI model), 82
数据挖掘,769
data mining, 769
数据建模语言
data modeling languages
开放配置,764
OpenConfig, 764
数据包、网络故障的应用程序影响
data packets, application impacts of network failures
丢弃数据包,617
dropped packets, 617
重复数据包,617
duplicate packets, 617
无序数据包,617
out-of-order packets, 617
数据验证,传输安全,250
data validation, transport security, 250
databases, mapping, interlayer discovery, 152–153
戴维斯和分组交换网络,唐纳德,13 岁
Davies and packet switched networks, Donald, 13
DDoS(分布式拒绝服务)攻击,750
DDoS (Distributed Denial of Service) attacks, 750
botnets and DDoS reflection attacks, 572–574
DDoS scrubbers/services, 581–582
预防
preventing
阻止半开/格式错误的会话,575
blocking half-open/malformed sessions, 575
dispersing traffic over multiple servers, 576–577
filtering unroutable addresses, 578–579
主机操作系统修改,575
host operating system modifications, 575
重复数据删除(数据)和存储,657
deduplication (data) and storage, 657
默认网关
default gateways
延迟,应用程序分解,662
delays, application disaggregation, 662
需求模式(BFD),382
Demand mode (BFD), 382
带回显的需求模式 (BFD),382
Demand mode with echo (BFD), 382
deployment automation, 696–697
detecting errors (error management), 48- 53, 55
开发运营,695
DevOps, 695
DFS(深度优先搜索)和 MRT(最大冗余树),363 – 366
DFS (Depth First Search) and MRT (Maximally Redundant Trees), 363–366
DHCP(动态主机配置协议)
DHCP (Dynamic Host Configuration Protocol)
DHCPv6
DHCPv6
无状态DHCPv6,158
stateless DHCPv6, 158
有状态 DHCP,158
stateful DHCP, 158
图表(网络), 281 – 282 , 284 , 307 – 308
diagrams (network), 281–282, 284, 307–308
控制平面,283
control planes, 283
advertising reachability/topologies, 295–298
proactive distribution of reachability, 300–302
redistribution of reachability/topologies, 303–307
三次握手,291
three-way handshakes, 291
边缘, 285
edges, 285
定义, 284
defining, 284
叶节点,284
leaf nodes, 284
中转节点,284
transit nodes, 284
reachable destinations, 286–287, 293
advertising reachability, 295–298
proactive distribution of reachability, 300–302
reactive distribution of reachability, 298–300
redistribution between control planes, 303–307
字典
dictionaries
shared object dictionaries, 46–47
Unicode 词典,42
Unicode dictionaries, 42
数字语法
digital grammars
dictionaries, protocols as, 40–47
错误管理, 38 , 39 , 47 – 48 , 48 – 53
error management, 38, 39, 47–48, 48–53
反馈循环,64
feedback loops, 64
协议
protocols
定义, 40
defining, 40
字典, 42
dictionaries, 42
灵活性,40
flexibility, 40
元数据权衡,40
metadata tradeoffs, 40
优化, 40
optimizing, 40
资源效率,40
resource efficiency, 40
shared object dictionaries, 46–47
Dijkstra 的 SPF(最短路径优先),341 – 349
Dijkstra’s SPF (Shortest Path First), 341–349
被淹没的分布,IS-IS,436
flooded distributions, IS-IS, 436
历史, 342
history of, 342
disaggregated networks, 654–656, 677
应用程序分解, 658 – 659 , 662 , 672 – 676
application disaggregation, 658–659, 662, 672–676
east/west traffic flows, 659–661
packet switched fabrics, 662–666
路由器, 673
routers, 673
disjoint path algorithms, 356–357
Suurballe’s disjoint path algorithms, 358–363
two-connected networks, 357–358
dispersing DDoS attacks, 576–577
距离矢量协议,399
distance vector protocols, 399
无环路路径,22
loop-free paths, 22
冲水定时器,415
flush timers, 415
抑制定时器,415
hold-down timers, 415
触发更新,415
triggered updates, 415
路由表,424
routing tables, 424
数据包转发,402
packet forwarding, 402
分组交换,402
packet switching, 402
reachable destinations, 407–408
distributed control planes, 14–15, 398–399
距离矢量协议,399
distance vector protocols, 399
链路状态协议,399
link state protocols, 399
路径矢量协议,399
path vector protocols, 399
distributed databases, CAP theorem, 392–394
DNS (Domain Name Systems), interlayer discovery, 154–156
DoD (Department of Defense) model, 76–77
四层 DoD 模型,778
four-layer DoD model, 778
应用层,78
application layer, 78
互联网层,77
Internet layer, 77
物理层,77
physical layer, 77
传输层,77
transport layer, 77
丢包、网络故障对应用程序的影响、617
dropped packets, application impacts of network failures, 617
DSCP(差分服务代码点)、QoS
DSCP (Differentiated Services Code Point), QoS
DSCP 突变,204
DSCP mutation, 204
以太网 DSCP 和 IPv4 ToS 字段,200 – 202
Ethernet DSCP and IPv4 ToS fields, 200–202
DUAL(扩散更新算法),330
DUAL (Diffusing Update Algorithm), 330
发展, 331
development of, 331
重复数据包、网络故障的应用程序影响、617
duplicate packets, application impacts of network failures, 617
Dyn、物联网和 DDoS 攻击,744 – 745 , 750
Dyn, IoT and DDoS attacks, 744–745, 750
east/west traffic flows, network design, 659–661
ECMP(等价多路径),178
ECMP (Equal Cost Multipath), 178
链路聚合,178
link aggregation, 178
LACP,181
LACP, 181
routed parallel links, 182–183
边缘, 285
edges, 285
Bellman-Ford 无环路路径计算,326
Bellman-Ford loop-free path calculation, 326
负成本边,Bellman-Ford 无环路径计算,330
negative cost edges, Bellman-Ford loop-free path calculation, 330
EEM(嵌入式事件管理器),机载自动化,694
EEM (Embedded Event Manager), on-box automation, 694
EF (Expedited Forwarding), QoS, 202, 203
EGP (Exterior Gateway Protocol), BGP as, 451–452
EIGRP (Enhanced Interior Gateway Protocol), 416, 424
大象流,180
elephant flows, 180
control plane policies, 468–473
QoS 和拥塞管理,214
QoS and congestion management, 214
添加加密条目
add cryptography entries
端到端加密,101
end-to-end encryption, 101
逐跳加密,101
hop-by-hop encryption, 101
MAC address randomization, 265–266
private key cryptography, 570–571
处理器, 657
processors, 657
公共云,738
public clouds, 738
public key cryptography, 570–571
存储和,657
storage and, 657
运输安全
transport security
cipher blocks as substitution tables, 253–255
cryptographic functions, 255, 258, 259
multiple rounds of encryption, 259–260
substitution tables generated by large key transforms, 255–258
endpoint isolation and IoT security, 747–748
端到端延迟、网络故障的应用程序影响,616
end-to-end delays, application impacts of network failures, 616
端到端加密,101
end-to-end encryption, 101
error management, 38, 39, 47–48
Wireless 802.11, 109
Ethereum blockchain system, 777–778
DSCP 和 IPv4 ToS 字段、QoS、200 – 202
DSCP and IPv4 ToS fields, QoS, 200–202
流量控制,101
flow control, 101
编组,100
marshaling, 100
OSI 模型,83
OSI model, 83
RINA 型号,85
RINA model, 85
switched Ethernet network operation, 98–99
virtual networks, Ethernet services over IP networks, 226–227
event-driven failure detection, 377–378
复杂性(网络),380
complexity (network), 380
polling-based failure detection versus, 378–379
examination, protecting data from (transport security), 250–251
排气(数据), 251 – 252 , 264 – 265 , 571
exhaust (data), 251–252, 264–265, 571
期待脚本和自动化,682
Expect scripting and automation, 682
利用、定义、564
exploits, defining, 564
FaaS(函数即服务),724
FaaS (Functions as a Service), 724
故障检测
failure detection
event-driven failure detection, 377–378
复杂性,380
complexity, 380
polling-based failure detection versus, 378–379
polling-based failure detection, 376–377, 378–379
误报和控制平面,375
false positives and control planes, 375
快速切换路径。请参阅 中断上下文切换
fast switching paths. See interrupt context switching
命运共享,139
fate sharing, 139
feature creep in cloud computing, 730–731
features versus usage (network engineering), 8–9
FEC (Forward Error Correction), 53–54
反馈循环,64
feedback loops, 64
FIB (Forwarding Information Base), 14, 386–387
过滤不可路由的地址,防止 DDoS 攻击,578 – 579
filtering unroutable addresses, preventing DDoS attacks, 578–579
fingerprints as passwords, 562–563
fixed length fields (protocols), 43–44
fixed window flow control, 67–69
灵活性
flexibility
网络设计
network design
淹没分布
flooded distributions
网络设备之间的洪泛,383
flooding between network devices, 383
伊斯兰国-伊斯兰国,436
IS-IS, 436
网络故障对应用程序的影响,616
application impacts of network failures, 616
circuit switched networks, 15–16
以太网,101
Ethernet, 101
反馈循环,64
feedback loops, 64
packet switched networks, 15–16
传输控制协议
TCP
重传,132
retransmitting, 132
RWND,133
RWND, 133
麻袋,132
SACK, 132
windowed flow control with serial numbers, 130–131
窗口化, 65
windowing, 65
fixed window flow control, 67–69
single packet windows (ping pong), 65–68
Wireless 802.11, 109
大象流,180
elephant flows, 180
鼠标流量,180
mouse flows, 180
流动钉扎
flow pinning
application optimization, 468–473, 478
control plane policies, 468–473
冲水定时器,415
flush timers, 415
forklift and network design flexibility, 593–595
forwarding packets, virtual networks, 223–225
转发飞机。请参阅 数据(转发)平面(TDM 系统)
forwarding planes. See data (forwarding) planes (TDM systems)
四层 DoD 模型,778
four-layer DoD model, 778
应用层,78
application layer, 78
互联网层,77
Internet layer, 77
物理层,77
physical layer, 77
传输层,77
transport layer, 77
碎片化
fragmentation
加西亚-卢纳-阿塞维斯和双,JJ,331
Garcia-Luna-Aceves and DUAL, J.J., 331
网关(默认)
gateways (default)
Gigabit Ethernet, 8B10B encoding scheme, 99–100
有效产出与吞吐量,110
goodput versus throughput, 110
GR (Graceful Restart), 622–623
贪心算法,317
greedy algorithms, 317
gRPC,47
gRPC, 47
半开放/格式错误的会话,阻塞(安全),575
half-open/malformed sessions, blocking (security), 575
half-split method (network troubleshooting), 641–643, 645
握手
handshakes
三向握手,控制平面,291
three-way handshakes, control planes, 291
双向握手
two-way handshakes
硬件卸载、VNF、717
hardware offload, VNF, 717
哈希桶,179
hash buckets, 179
hashes (cryptographic), 263–264
headers, NSH, service chaining, 708–709
hidden nodes, wireless networks, 108–109
隐藏信息,526
hiding information, 526
aggregating reachability information, 515–518
边界网关协议
BGP
caching, control plane information, 520–525
control plane state scope, 508–510
filtering reachability information, 518–519
positive feedback loops, 510–513
slowing down state velocity, 525–526, 548
link state flooding reduction, 552–554
解空间,513
solution space, 513
SR with controller overlay, 546–548
summarizing topology information, 514–515, 530
分层网络设计,600
hierarchical network design, 600
recursive hierarchical network design, 602–603
three-tier hierarchical network design, 600–601
两层分层网络设计,601
two-tier hierarchical network design, 601
更高级别的传输协议,116
higher level transport protocols, 116
知识产权,116
IP, 116
head of line blocking, 138–139
重传,138
retransmissions, 138
知识产权开发,117
IP development, 117
端口号,135
port numbers, 135
HOL(队头)阻塞,数据包交换,188
HOL (Head-of-Line) blocking, packet switching, 188
抑制定时器,415
hold-down timers, 415
霍尼韦尔实验室和 IS-IS,430
Honeywell Labs and IS-IS, 430
跳数,122
hop counts, 122
跳数限制,122
hop limits, 122
hop-by-hop distributions, 387–389
逐跳加密,101
hop-by-hop encryption, 101
主机操作系统,安全性,575
host operating systems, security, 575
hot potato routing, control plane policies, 464–466
how models (network troubleshooting), 633–634
HTTPS(安全超文本传输协议)、公共云、公共云上的数据保护、738
HTTPS (Hypertext Transfer Protocol Secure), public clouds, data protection over public clouds, 738
hub-and-spoke topologies, 239–240, 609–610
混合云,725
hybrid clouds, 725
hyperconverged networks, 656, 765–767
I2RS (Interface to the Routing Systems), 490–495
IaaS(基础设施即服务),724
IaaS (Infrastructure as a Service), 724
ICMP(互联网控制消息协议), 117 , 142 – 143
ICMP (Internet Control Message Protocol), 117, 142–143
标识符映射、层间发现、150 – 151、153 – 154
identifier mapping, interlayer discovery, 150–151, 153–154
incremental SPF (Shortest Path First), 349–350
信息隐藏,526
information hiding, 526
aggregating reachability information, 515–518
边界网关协议
BGP
caching, control plane information, 520–525
control plane state scope, 508–510
filtering reachability information, 518–519
第598章
modularity and, 598
positive feedback loops, 510–513
slowing down state velocity, 525–526, 548
link state flooding reduction, 552–554
解空间,513
solution space, 513
SR with controller overlay, 546–548
summarizing topology information, 514–515, 530
第375章
information overload, control planes and, 375
infrastructure automation tools, 694–695
输入排队交换机、数据包交换、188
input-queued switches, packet switching, 188
基于意图的网络, 714 – 715 , 767 – 769
intent-based networking, 714–715, 767–769
inter-area router LSA, 540–541
multiple overlay control planes, 476–478
网络功能虚拟化,718
NFV, 718
虚拟网络
virtual networks
overlaid control panels, 243–245
shared risk link groups, 242–243
层间发现,151
interlayer discovery, 151
标识符计算,154
identifier calculations, 154
identifier mapping, 150–151, 153–154
IPv4
IPv4
ARP, interlayer discovery, 159–161
IPv6, default gateways, 166–167
manually configured identifiers, 151–152
well known identifiers, 151–152
Internet, reshaping of, 778–780
互联网层(四层 DoD 模型),77
Internet layer (four-layer DoD model), 77
interrupt context switching, 183–186
IoT (Internet of Things), 743, 757
连接性, 745 , 751 , 751 – 752 , 753
connectivity, 745, 751, 751–752, 753
LoRaWAN 、753 – 754、755 – 756 _
数据处理,745
data processing, 745
移动物联网,755
mobile IoT, 755
可扩展性,754
scalability, 754
安全,745
security, 745
isolation-based security, 746–748
IP(互联网协议),116
IP (Internet Protocol), 116
IPv4,118 _
IPv4, 118
地址空间使用情况,118
address space usage, 118
ARP, interlayer discovery, 159–161
爸爸,160
DAD, 160
ToS fields and Ethernet DSCP, 200–202
IPv6,118 _
IPv6, 118
爸爸,162
DAD, 162
DHCPv6, interlayer discovery, 156–159
SLAAC,162
SLAAC, 162
OSI 模型,83
OSI model, 83
欺骗地址,137
spoofing addresses, 137
virtual networks, Ethernet services over IP networks, 226–227
IPSec(IP安全)
IPSec (IP Security)
公共云、公共云上的数据保护、738
public clouds, data protection over public clouds, 738
RINA 型号,85
RINA model, 85
IPX(互联网数据包交换),16
IPX (Internet Packet Exchange), 16
IS-IS (Intermediate System to Intermediate System), 431, 439
被淹没的分布区,436
flooded distributions, 436
历史, 430
history of, 430
链路状态协议,449
link state protocols, 449
链接, 449
links, 449
multiaccess links/networks, 446–449
节点,449
nodes, 449
summarizing topology information, 530–535
iSLIP algorithm, packet switching, 189–190
基于隔离的安全和物联网,746
isolation-based security and IoT, 746
service-based isolation, 746–747
ISSU(服务中软件升级),623
ISSU (In-Service Software Upgrades), 623
ITU(国际电信联盟),16
ITU (International Telecommunications Union), 16
抖动
jitter
应用程序分解,662
application disaggregation, 662
网络故障对应用程序的影响,616
application impacts of network failures, 616
BFD 和,382
BFD and, 382
公共云,736
public clouds, 736
卡恩和 TCP,鲍勃,16 岁
Kahn and TCP, Bob, 16
柯克霍夫原理,257
Kerckhoff’s principle, 257
密钥交换(密码学),261
key exchanges (cryptography), 261
信任的中心来源,262
central source of trust, 262
公钥基础设施,262
PKI, 262
private key cryptography, 262–263
传递信任,262
transitive trust, 262
信任网,262
web of trust, 262
Krebs 和 IoT DDoS 攻击,Brian,744
Krebs and IoT DDoS attacks, Brian, 744
KrebsOnSecurity.com ,物联网和DDoS攻击,744 – 745、746、750
KrebsOnSecurity.com, IoT and DDoS attacks, 744–745, 746, 750
label switching (ATM), 17–18, 20
LACP(链路聚合控制协议),181
LACP (Link Aggregation Control Protocol), 181
潜伏
latency
从机延迟,752
slave latency, 752
劳伦斯利弗莫尔实验室,分组交换网络,13
Lawrence Livermore Laboratory, packet switched networks, 13
layering, information hiding, 543–544
叶节点,284
leaf nodes, 284
LFA。查看 替代的无环路路径
LFA. See alternative loop-free paths
链路聚合
link aggregation
LACP,181
LACP, 181
无序数据包,178
out-of-order packets, 178
link failures, RIP and, 414–415
link state detection, BFD, 380–382
link state flooding reduction, 552–554
链路状态协议,399
link state protocols, 399
flooded distributions, 436–439
历史, 430
history of, 430
multiaccess links/networks, 446–449
节点,449
nodes, 449
summarizing topology information, 530–535
无环路路径,22
loop-free paths, 22
flooded distributions, 443–445
历史, 430
history of, 430
节点,449
nodes, 449
summarizing topology information, 535–542
链接
links
伊斯兰国-伊斯兰国,449
IS-IS, 449
OSPF,449
OSPF, 449
LLQ (低延迟队列)、QoS 和拥塞管理,208 – 212、214、217
LLQ (Low-Latency Queuing), QoS and congestion management, 208–212, 214, 217
负载均衡器、虚拟网络、702
load-balancers, virtual networks, 702
无环路路径,20
loop-free paths, 20
alternate loop-free paths, 317–319
P/Q 空间模型, 321 – 322 , 352 – 353
P/Q Space model, 321–322, 352–353
waterfall (continental divide) model, 320–321
Bellman-Ford 无环路路径计算,324
Bellman-Ford loop-free path calculation, 324
跨样本网络的循环,330
cycles across sample networks, 330
边缘, 326
edges, 326
负成本边际,330
negative cost edges, 330
拓扑结构,326
topologies, 326
历史, 342
history of, 342
disjoint path algorithms, 356–357
Suurballe’s disjoint path algorithms, 358–363
two-connected networks, 357–358
距离矢量协议,22
Distance Vector protocols, 22
双,330
DUAL, 330
330的发展
development of, 330
链路状态协议,22
Link State protocols, 22
路径矢量协议,22
Path Vector protocols, 22
协议战争,22
protocol wars, the, 22
循环
loops
反馈循环,64
feedback loops, 64
跳数,122
hop counts, 122
跳数限制,122
hop limits, 122
微循环,395
microloops, 395
集中控制平面,390
centralized control planes, 390
flooded distributions, 384–385
逐跳分发,388
hop-by-hop distributions, 388
行动, 585
actions, 585
观察,583
observation, 583
正反馈循环,375
positive feedback loops, 375
routing loops, redistribution of reachability/topologies, 306–307
LoRaWAN 和物联网, 753 – 754 , 755 – 756
LoRaWAN and IoT, 753–754, 755–756
有损(电路),公共云,736
lossiness (circuit), public clouds, 736
洛希德和 BGP,柯克,451
Lougheed and BGP, Kirk, 451
lower layer transport protocols, 110–111
流量控制,101
flow control, 101
编组,100
marshaling, 100
switched Ethernet network operation, 98–99
Wireless 802.11, 102
错误管理,109
error management, 109
流量控制,109
flow control, 109
编组,109
marshaling, 109
LSA(链路状态协议),区域间路由器 LSA,540 – 541
LSA (Link State Agreements), inter-area router LSA, 540–541
MAC(媒体访问控制)地址
MAC (Media Access Control) addresses
MAC-48/EUI-48 address format, 97–98
randomization (encryption), 265–266
格式错误/半开会话,阻塞(安全),575
malformed/half-open sessions, blocking (security), 575
中间人 (MITM) 攻击,268 – 269 , 568 – 569
man-in-the-middle (MITM) attacks, 268–269, 568–569
管理平面(TDM 系统),12
management planes (TDM systems), 12
管理站(NETCONF),686
management stations (NETCONF), 686
映射
mapping
databases, interlayer discovery, 152–153
port mapping, interlayer discovery, 151–152
dictionaries, protocols as, 40–47
以太网,100
Ethernet, 100
协议
protocols
定义, 40
defining, 40
字典, 42
dictionaries, 42
灵活性,40
flexibility, 40
元数据权衡,40
metadata tradeoffs, 40
优化, 40
optimizing, 40
资源效率,40
resource efficiency, 40
shared object dictionaries, 46–47
Wireless 802.11, 109
mashed potato routing, control plane policies, 464–466
记忆
memory
计算内存的历史,21
computing memory, history of, 21
分组交换
packet switching
clocking packets from memory, 190–191
clocking packets to memory, 173–174
指标
metrics
多重度量问题,356
multiple metric problem, 356
redistribution of reachability/topologies, 306–307
MIB tables and automation, 682–683
微循环,395
microloops, 395
集中控制平面,390
centralized control planes, 390
flooded distributions, 384–385
逐跳分发,388
hop-by-hop distributions, 388
MITM(中间人)攻击,268 – 269 , 568 – 569
MITM (Man-In-The-Middle) attacks, 268–269, 568–569
MLAG (Multichassis Link Aggregation), 181–182
移动物联网(物联网),755
mobile IoT (Internet of Things), 755
建模语言
modeling languages
开放配置,764
OpenConfig, 764
网络建模(故障排除)
modeling networks (troubleshooting)
shifting between models, 639–641
模块化
modularity
信息隐藏,598
information hiding, 598
复杂性,599
complexity, 599
信息隐藏,598
information hiding, 598
优化,599
optimization, 599
可扩展性,599
scalability, 599
权衡,598
tradeoffs, 598
MPLS(多协议标签交换),20
MPLS (Multiprotocol Label Switching), 20
标头,233
headers, 233
作为隧道协议,236
as tunneling protocol, 236
MRT (Maximally Redundant Trees), 363–366
MST (Minimum Spanning Trees), 315–316, 317
MTBF (Mean Time Between Failures), 617–618
MTBM(平均错误间隔时间),619
MTBM (Mean Time Between Mistakes), 619
MTTI(纯真平均时间),619
MTTI (Mean Time To Innocence), 619
MTTR (Mean Time To Repair), 618, 624–626
MTU(最大传输单元)
MTU (Maximum Transmission Units)
控制平面,291
control planes, 291
PMTUD,229
PMTUD, 229
multiple overlay control planes, interaction surfaces, 476–478
IPv6,123 _
IPv6, 123
spatial multiplexing, 103–104, 106–107
虚拟网络和,222
virtual networks and, 222
virtualization versus, 282–283
Wireless 802.11, 102
multiple paths within a single room, 103–104
spatial multiplexing, 103–104, 106–107
NACK(否定确认),QUIC 重传,138
NACK (Negative Acknowledgements), QUIC retransmissions, 138
NAT (Network Address Translation), IoT and, 754–755
NCP(网络控制协议),标志日,41
NCP (Network Control Protocol), flag days, 41
NDN(命名数据网络),772
NDN (Named Data Networking), 772
负成本边,Bellman-Ford 无环路径计算,330
negative cost edges, Bellman-Ford loop-free path calculation, 330
negotiated bit rates, flow control, 69–70
邻居发现,294
neighbor discovery, 294
control planes, detecting devices from, 287–290
网络会议,685
NETCONF, 685
配置, 686
configuring, 686
数据存储,685
data stores, 685
管理站,686
management stations, 686
运营, 687
operations, 687
YANG data modeling language and, 687–689
网络图, 281 – 282 , 284 , 307 – 308
network diagrams, 281–282, 284, 307–308
控制平面,283
control planes, 283
advertising reachability/topologies, 295–298
proactive distribution of reachability, 300–302
redistribution of reachability/topologies, 303–307
三次握手,291
three-way handshakes, 291
边缘, 285
edges, 285
定义, 284
defining, 284
叶节点,284
leaf nodes, 284
中转节点,284
transit nodes, 284
reachable destinations, 286–287, 293
advertising reachability, 295–298
proactive distribution of reachability, 300–302
reactive distribution of reachability, 298–300
redistribution between control planes, 303–307
business to technology fit, 7–9
780的未来
future of, 780
network layer (OSI model), 82–83
网络,612
networks, 612
自动提款机,17
ATM, 17
多协议标签交换,20
MPLS, 20
API,763
API, 763
自动化工程师,681
automation engineers, 681
机载自动化,694
on-box automation, 694
命令行界面, 681 – 682 , 684 , 763 , 763
第680章
complexity and, 680
controller-based automation, 695–696
数据分析,697
data analytics, 697
deployment automation, 696–697
期待脚本,682
Expect scripting, 682
infrastructure automation tools, 694–695
机器学习,697
machine learning, 697
pervasive network automation, 763, 765
木偶组件/清单,自动化,695
puppet components/manifests, automation, 695
正则表达式,681
regular expressions, 681
circuit switched networks, 12–13
优点, 11
advantages of, 11
packet switched networks versus, 13–15
自动化,以及,680
automation and, 680
原因, 26
reasons for, 26
权衡,33
tradeoffs, 33
converged networks, 605–607, 655
重复数据删除,657
data deduplication, 657
disaggregated networks, 654–656, 677
应用程序分解, 658 – 659 , 662 , 672 – 676
application disaggregation, 658–659, 662, 672–676
east/west traffic flows, 659–661
packet switched fabrics, 662–666
路由器, 673
routers, 673
east/west traffic flows, 659–661
failures, application impacts of, 616, 617
数据流控制,616
data flow control, 616
丢弃数据包,617
dropped packets, 617
重复数据包,617
duplicate packets, 617
端到端延迟,616
end-to-end delays, 616
抖动,616
jitter, 616
无序数据包,617
out-of-order packets, 617
灵活性
flexibility
分层设计,600
hierarchical design, 600
recursive hierarchical network design, 602–603
three-tier hierarchical network design, 600–601
两层分层网络设计,601
two-tier hierarchical network design, 601
hub-and-spoke topologies, 609–610
hyperconverged networks, 656, 765–767
复杂性,599
complexity, 599
优化,599
optimization, 599
可扩展性,599
scalability, 599
权衡,598
tradeoffs, 598
noncontending networks, 668–669
优化,599
optimization, 599
超大网络,597
oversized networks, 597
packet switched fabrics, 662–666
packet switched networks, 22–25
优点/缺点, 15
advantages/disadvantages of, 15
circuit switched networks versus, 13–15
的发展, 13
development of, 13
distributed control planes, 14–15
FIB,14
FIB, 14
元数据,13
metadata, 13
肋骨,14
RIB, 14
paths, congestion choke points, 196–197
问题, 592
problems with, 592
冗余
redundancy
ISSU,623
ISSU, 623
更换设备,596
replacing equipment, 596
弹性
resiliency
定义, 617
defining, 617
ISSU,623
ISSU, 623
MTBM,619
MTBM, 619
MTTI,619
MTTI, 619
环形拓扑,605
ring topologies, 605
环形拓扑,603
ring topologies, 603
弹性,605
resiliency, 605
交通工程,604
traffic engineering, 604
路由器,分散网络,673
routers, disaggregated networks, 673
spine and leaf topologies, 667–668, 669
贮存
storage
压缩, 657
compression, 657
融合网络,655
converged networks, 655
重复数据删除,657
data deduplication, 657
分类网络,656
disaggregated networks, 656
加密,657
encryption, 657
超融合网络,656
hyperconverged networks, 656
拉紧
stretch
control plane state versus, 28–29
定义, 28
defining, 28
流量工程,环形拓扑,604
traffic engineering, ring topologies, 604
故障排除,633
troubleshooting, 633
accuracy in network models, 637–638
half-split method, 641–643, 645
shifting between models, 639–641
shifting between signals, 642–643
规模较小的网络,596
undersized networks, 596
centralized policy management, 713–714
复杂性,241
complexity, 241
融合网络,655
converged networks, 655
定义, 221
defining, 221
disaggregated networks, 654–656
Ethernet services over IP networks, 226–227
超融合网络,656
hyperconverged networks, 656
基于意图的网络, 714 – 715 , 767 – 769
intent-based networking, 714–715, 767–769
interaction surfaces, 242–243, 243–245
第222章
multiplexing and, 222
物理网络转换到,654
physical network transitions to, 654
处理器, 657
processors, 657
SR, 230 – 232 , 232 – 236 , 236 – 237 , 237 – 238
SR, 230–232, 232–236, 236–237, 237–238
拓扑结构,222
topologies, 222
NFV (网络功能虚拟化)、703、708、719
NFV (Network Function Virtualization), 703, 708, 719
交互表面,718
interaction surfaces, 718
network design flexibility, 703–705
优化,718
optimization, 718
虚拟化服务,717
virtualized services, 717
定义, 284
defining, 284
伊斯兰国-伊斯兰国,449
IS-IS, 449
叶节点,284
leaf nodes, 284
OSPF,449
OSPF, 449
中转节点,284
transit nodes, 284
noncontending networks, 668–669
北向接口,483
northbound interfaces, 483
NPU(网络处理单元),数据包交换,185
NPU (Network Processing Units), packet switching, 185
NSH (Network Service Headers), service chaining, 708–709
模糊性和安全性, 258 – 259 , 571 – 572
obscurity and security, 258–259, 571–572
八达通,分组交换网络,13
Octopus, packet switched networks, 13
OFDM(正交频分复用)、无线802.11、102 – 103
OFDM (Orthogonal Frequency Division Multiplexing), Wireless 802.11, 102–103
机载自动化,694
on-box automation, 694
行动, 585
actions, 585
观察,583
observation, 583
开放网络
open networking
DDoS 攻击,750
DDoS attacks, 750
DDoS 攻击,750
DDoS attacks, 750
uRPF,750
uRPF, 750
OpenConfig 数据建模语言,764
OpenConfig data modeling language, 764
OPEX (Operational Expenses), public clouds, 726–727
opportunity costs, network design, 592–593
优化
optimization
applications via flow pinning, 468–473, 478
网络设计,599
network design, 599
网络功能虚拟化,718
NFV, 718
traffic flows, control plane policies, 473–474
ordered FIB (Forwarding Information Base) and microloops, 386–387
OSI (开源互连)模型,80 – 82、637 – 638
OSI (Open Source Interconnect) model, 80–82, 637–638
应用层,83
application layer, 83
数据链路层,82
data link layer, 82
以太网和,83
Ethernet and, 83
知识产权和,83
IP and, 83
OSI addressing and IS-IS, 431–433
表示层,83
presentation layer, 83
会话层,83
session layer, 83
TCP 和,83
TCP and, 83
传输层,83
transport layer, 83
OSPF (开放最短路径优先)、22、440、445 – 446
OSPF (Open Shortest Path First), 22, 440, 445–446
flooded distributions, 443–445
历史, 430
history of, 430
inter-area router LSA, 540–541
链路状态协议,449
link state protocols, 449
链接, 449
links, 449
节点,449
nodes, 449
summarizing topology information, 535–542
完全不那么粗短的区域,540
totally not-so-stubby areas, 540
完全粗短区域,539
totally stubby areas, 539
ossification and network design flexibility, 593–594
乱序数据包
out-of-order packets
网络故障对应用程序的影响,617
application impacts of network failures, 617
链路聚合,178
link aggregation, 178
过度拥塞,QoS,214
overcongestion, QoS, 214
overlaid control panels, interaction surfaces and (virtual networks), 243–245
覆盖
overlay
BGP reachability overlay, 544–546
SR with controller overlay, 546–548
超大网络,网络设计,597
oversized networks, network design, 597
ownership, network design, 594–595
P/Q 空间模型,备用无环路路径,321 – 322 , 352 – 353
P/Q Space model, alternate loop-free paths, 321–322, 352–353
PaaS(平台即服务),724
PaaS (Platform as a Service), 724
packet switched fabrics, 662–666
分组交换网络
packet switched networks
优点/缺点, 15
advantages/disadvantages of, 15
circuit switched networks versus, 13–15
的发展, 13
development of, 13
distributed control planes, 14–15
FIB,14
FIB, 14
元数据,13
metadata, 13
集中控制平面,25
centralized control planes, 25
QoS 标记,23
QoS marking, 23
肋骨,14
RIB, 14
数据包
packets
advertisement paths and, 401–402
缓冲器, 173
buffers, 173
转发
forwarding
STP,402
STP, 402
回收,232
recycling, 232
clocking packets from memory, 190–191
clocking packets to memory, 173–174
ECMP, 178 , 178 – 181 , 181 , 181 – 182 , 182 – 183
ECMP, 178, 178–181, 181, 181–182, 182–183
HOL 阻塞,188
HOL blocking, 188
输入排队交换机,188
input-queued switches, 188
interrupt context switching, 183–186
西北工业大学,185
NPU, 185
数据包缓冲区,173
packet buffers, 173
processing packets, 174, 183–186
接收环,173
receive rings, 173
环形缓冲区,174
ring buffers, 174
STP,402
STP, 402
partial SPF (Shortest Path First), 349–350
passwords, fingerprints as, 562–563
路径矢量协议,399
path vector protocols, 399
边界网关协议,458
BGP, 458
loop-free paths, 22, 454–456, 458
路线反射器,457
route reflectors, 457
loop-free paths, 22, 454–456, 458
PBR(基于策略的路由),服务链,708
PBR (Policy-Based Routing), service chaining, 708
PCEP (Path Control Element Protocol), 495–497
帕尔曼和 STP,拉迪亚,402
Perlman and STP, Radia, 402
pervasive network automation, 763, 765
物理层
physical layer
四层 DoD 模型,77
four-layer DoD model, 77
PKI(公钥基础设施),262
PKI (Public Key Infrastructure), 262
PMTUD(路径 MTU 检测),229
PMTUD (Path MTU Detection), 229
PN(可编程网络),482
PN (Programmable Networks), 482
北向接口,483
northbound interfaces, 483
southbound interfaces, 483–484
监管、QoS 和拥塞管理,215
policing, QoS and congestion management, 215
policy management, virtual networks, 713–714
polling-based failure detection, 376–377, 378–379
port mapping, interlayer discovery, 151–152
positive feedback loops, 375, 510–513
Postel 和知识产权开发,Jonathan B.,117
Postel and IP development, Jonathan B., 117
电源管理、计算能力的历史、21
power management, history of computing power, 21
表示层(OSI 模型),83
presentation layer (OSI model), 83
privacy (user), transport security, 251–252
私有云,725
private clouds, 725
private key cryptography, 570–571
public key cryptography versus, 260–261
处理器
processors
加密,657
encryption, 657
存储, 657
storage, 657
虚拟网络,657
virtual networks, 657
需求证明、QUIC、启动握手、137
proof demand, QUIC, startup handshakes, 137
协议栈,16
protocol stacks, 16
协议
protocols
无连接协议,86
connectionless protocols, 86
面向连接的协议,86
connection-oriented protocols, 86
定义, 40
defining, 40
字典
dictionaries
shared object dictionaries, 46–47
Unicode 词典,42
Unicode dictionaries, 42
灵活性,40
flexibility, 40
元数据,40
metadata, 40
多重度量问题,356
multiple metric problem, 356
优化, 40
optimizing, 40
资源效率,40
resource efficiency, 40
边界网关协议,736
BGP, 736
业务敏捷性,727
business agility, 727
电路损耗,736
circuit lossiness, 736
云交换服务,734
cloud exchange services, 734
费用, 730
costs of, 730
数据引力,735
data gravity, 735
加密,738
encryption, 738
基础设施设计,729
infrastructure design, 729
基础设施故障,729
infrastructure failures, 729
抖动,736
jitter, 736
multiple Internet connection, 735–737
多租户云,739
multitenant clouds, 739
安全
security
data protection over public clouds, 737–738
HTTPS,738
HTTPS, 738
IPSec,738
IPSec, 738
managing secure connections, 738–739
监控云网络,740
monitoring cloud networks, 740
上市时间,727
time-to-market, 727
权衡
tradeoffs
非技术权衡,728
nontechnical tradeoffs, 728
operational tradeoffs, 728–731
public Internet and QoS, 206–207
public key cryptography, 570–571
private key cryptography versus, 260–261
public networks and VPN, 227–229
木偶组件/清单,自动化,695
puppet components/manifests, automation, 695
QoS(服务质量)
QoS (Quality of Service)
AF,202
AF, 202
缓冲数据包,215
buffering packets, 215
TCP,217
TCP, 217
UDP,217
UDP, 217
集中控制平面,25
centralized control planes, 25
AF,202
AF, 202
最佳实践,200
best practices, 200
DSCP 突变,204
DSCP mutation, 204
以太网 DSCP 和 IPv4 ToS 字段,200 – 202
Ethernet DSCP and IPv4 ToS fields, 200–202
RFC2597,AF,202
RFC2597, AF, 202
信任边界,201
trust boundaries, 201
拥塞管理,207
congestion management, 207
choke points, network paths, 196–197
大象流,214
elephant flows, 214
过度拥堵,214
overcongestion, 214
警务,215
policing, 215
流量整形,214
traffic shaping, 214
packet switched networks, 22–23
QoS 标记,23
QoS marking, 23
QoS 标记,23
QoS marking, 23
队列管理
queue management
缓冲数据包,215
buffering packets, 215
RFC2597,AF,202
RFC2597, AF, 202
AF,202
AF, 202
最佳实践,200
best practices, 200
DSCP 突变,204
DSCP mutation, 204
以太网 DSCP 和 IPv4 ToS 字段,200 – 202
Ethernet DSCP and IPv4 ToS fields, 200–202
RFC2597,AF,202
RFC2597, AF, 202
信任边界,201
trust boundaries, 201
信任边界,201
trust boundaries, 201
unmarked Internet and, 206–207
QUIC(快速用户数据报协议互联网连接), 117 , 136
QUIC (Quick User Datagram Protocol Internet Connections), 117, 136
head of line blocking, 138–139
重传,138
retransmissions, 138
兰德公司,分组交换网络,13
RAND Corporation, packet switched networks, 13
RBAC (Role-Based Access Control), public clouds, 739–740
reachable destinations, 286–287, 293
advertising reachability, 295–298
proactive distribution of reachability, 300–302
reactive distribution of reachability, 298–300
控制平面之间的重新分配,303
redistribution between control planes, 303
可达性
reachability
BGP reachability overlay, 544–546
信息隐藏
information hiding
aggregating reachability information, 515–518
filtering reachability information, 518–519
接收环,173
receive rings, 173
recursive hierarchical network design, 602–603
回收包,232
recycling packets, 232
RED (Random Early Detection), 215–216
冗余
redundancy
ISSU,623
ISSU, 623
里德-所罗门码,54
Reed-Solomon codes, 54
正则表达式,681
regular expressions, 681
雅科夫·雷克特尔
Rekhter, Yakov
ATM,标签交换,20
ATM, label switching, 20
边界网关协议,451
BGP, 451
remote storage, public clouds, 734–735
更换设备,网络设计,596
replacing equipment, network design, 596
弹性
resiliency
可用性,619
availability, 619
定义, 617
defining, 617
MTBM,619
MTBM, 619
MTTI,619
MTTI, 619
环形拓扑,605
ring topologies, 605
ISSU,623
ISSU, 623
resource segmentation, control plane policies, 466–468, 476
休息会议,689
RESTCONF, 689
RFC1918, open networking security, 749–750
RFC2474, class selectors, 202–203
RFC2597,AF,202
RFC2597, AF, 202
RFC2827, open networking security, 749–750
RFC3535 2002 年 IAB 网络管理研讨会概述,683 – 684
RFC3535 Overview of the 2002 IAB Network Management Workshop, 683–684
RFC3704,开放网络安全,750
RFC3704, open networking security, 750
RIB(路由信息库),14
RIB (Routing Information Base), 14
RINA(递归互联网架构)模型,84 – 86 , 637 – 638
RINA (Recursive Internet Architecture) model, 84–86, 637–638
环形缓冲区,174
ring buffers, 174
环形拓扑,603
ring topologies, 603
弹性,605
resiliency, 605
交通工程,604
traffic engineering, 604
RIP (路由信息协议)、410 – 411、415 – 416
RIP (Routing Information Protocol), 410–411, 415–416
冲水定时器,415
flush timers, 415
抑制定时器,415
hold-down timers, 415
触发更新,415
triggered updates, 415
风险(安全),定义,565
risks (security), defining, 565
rLFA (远程无环路替代)、322 – 324、352 – 353
rLFA (remote Loop-Free Alternate), 322–324, 352–353
罗斯金德和 QUIC,吉姆,136
Roskind and QUIC, Jim, 136
路由反射器,BGP,457
route reflectors, BGP, 457
路由器/路由
routers/routing
cold potato routing, control plane policies, 464–466
分类网络,673
disaggregated networks, 673
hot potato routing, control plane policies, 464–466
inter-area router LSA, 540–541
IPv6, router discovery, 161–164
mashed potato routing, control plane policies, 464–466
packet switching, routing, 175–177
PBR,服务链,708
PBR, service chaining, 708
routing loops, redistribution of reachability/topologies, 306–307
路由表,距离矢量协议,以及,424
routing tables, distance vector protocols and, 424
with controller overlay, 546–548
交换与路由,177
switching versus routing, 177
RTO (Retransmit Time Outs), TCP flow control, 132, 134
RTT(往返时间)
RTT (Round Trip Times)
RWND(接收窗口),TCP 窗口流量控制,133
RWND (Receive Window), TCP windowed flow control, 133
SaaS(软件即服务),724
SaaS (Software as a Service), 724
SACK(选择性确认),TCP 流量控制,132
SACK (Selective Acknowledgements), TCP flow control, 132
可扩展性
scalability
物联网,754
IoT, 754
scope of control plane state, 508–510
SD-WAN (Software-Defined Wide Area Networks), 239–241
中心辐射型拓扑,610
hub-and-spoke topologies, 610
SDN (Software Defined Networks), 15, 482
定义, 482
defined, 482
北向接口,483
northbound interfaces, 483
southbound interfaces, 483–484
AAA 系统,567
AAA systems, 567
资产,定义,565
assets, defining, 565
攻击面,定义,565
attack surfaces, defining, 565
攻击者(威胁行为者),定义,564
attackers (threat actors), defining, 564
攻击(威胁)
attacks (threats)
放大攻击,574
amplification attacks, 574
暴力攻击,258
brute-force attacks, 258
燃烧器攻击,574
burner attacks, 574
定义, 564
defining, 564
man-in-the-middle attacks, 268–269
blocking DDoS attacks upstream, 579–580
botnets and DDoS reflection attacks, 572–574
暴力攻击,258
brute-force attacks, 258
云计算,737
cloud computing, 737
data protection over public clouds, 737–738
HTTPS,738
HTTPS, 738
IPSec,738
IPSec, 738
control planes, MITM attacks, 568–569
DDoS reflection attacks, 572–574
DDoS scrubbers/services, 581–582
dispersing DDoS attacks, 576–577
private key cryptography, 570–571
公共云,738
public clouds, 738
public key cryptography, 570–571
利用、定义、564
exploits, defining, 564
filtering unroutable addresses, 578–579
半开放/格式错误的会话,阻塞,575
half-open/malformed sessions, blocking, 575
主机操作系统修改,575
host operating system modifications, 575
HTTPS、公共云、738
HTTPS, public clouds, 738
物联网,745
IoT, 745
isolation-based security, 746–748
IPSec、公共云、738
IPSec, public clouds, 738
基于隔离的安全和物联网,746
isolation-based security and IoT, 746
service-based isolation, 746–747
MAC address randomization, 265–266
man-in-the-middle attacks, 268–269
MITM attacks, control planes and, 568–569
obscurity and, 258–259, 571–572
行动, 585
actions, 585
观察,583
observation, 583
DDoS 攻击,750
DDoS attacks, 750
uRPF,750
uRPF, 750
passwords, fingerprints as, 562–563
问题空间,565
problem space, 565
公共云
public clouds
加密,738
encryption, 738
managing secure connections, 738–739
风险,定义,565
risks, defining, 565
解空间,565
solution space, 565
威胁行为者(攻击者),定义,564
threat actors (attackers), defining, 564
transport security, 249–250, 272–273
非对称密码学,260
asymmetric cryptography, 260
暴力攻击,258
brute-force attacks, 258
cipher blocks as substitution tables, 253–255
cryptographic functions, 255, 258, 259
data exhaust, 251–252, 264–265
柯克霍夫原理,257
Kerckhoff’s principle, 257
man-in-the-middle attacks, 268–269
multiple rounds of encryption, 259–260
private key cryptography, 260–261, 262–263
protecting data from examination, 250–251
public key cryptography, 260–263
substitution tables generated by large key transforms, 255–258
对称密码学,260
symmetric cryptography, 260
验证数据,250
validating data, 250
漏洞,定义,564
vulnerabilities, defining, 564
无服务器云服务,724
serverless cloud services, 724
service chaining, 705–707, 709–711
策略BR,708
PBR, 708
证监会,708
SFC, 708
service-based isolation and IoT security, 746–747
会话层(OSI 模型),83
session layer (OSI model), 83
SFC(服务功能链),708
SFC (Service Function Chaining), 708
克劳德·香农,39 岁
Shannon, Claude, 39
shared object dictionaries, 46–47
共享风险链接组、虚拟网络、交互界面和共享风险链接组,242 – 243
shared risk link groups, virtual networks, interaction surfaces and shared risk link groups, 242–243
最短路径
shortest paths
Bellman-Ford 无环路路径计算,324
Bellman-Ford loop-free path calculation, 324
跨样本网络的循环,330
cycles across sample networks, 330
边缘, 326
edges, 326
负成本边际,330
negative cost edges, 330
拓扑结构,326
topologies, 326
历史, 342
history of, 342
无环路径
loop-free paths
备用无环路路径, 317 – 324 , 350 – 352
alternate loop-free paths, 317–324, 350–352
signal repeaters, event-driven failure detection, 379–380
signal waveforms, Wireless 802.11 multiplexing, 105–106
single packet windows (ping pong), 65–68
SLAAC,IPv6 寻址,162
SLAAC, IPv6 addressing, 162
从机延迟,752
slave latency, 752
slowing down state velocity, 525–526, 548
link state flooding reduction, 552–554
SNMP(简单网络管理协议)、自动化和, 682 – 683 , 684
SNMP (Simple Network Management Protocol), automation and, 682–683, 684
软件
software
ISSU,623
ISSU, 623
SOS(状态、优化和表面)模型,复杂性(网络),管理,32 – 33
SOS (State, Optimization and Surface) model, complexity (network), managing, 32–33
southbound interfaces, 483–484
sparse mode multicasting, 60–61
空间复用、无线802.11、103 – 104、106 – 107
spatial multiplexing, Wireless 802.11, 103–104, 106–107
spine and leaf topologies, 667–668, 669
SPT (Shortest Path Trees), 315, 316–317
SR(分段路由)
SR (Segment Routing)
with controller overlay, 546–548
SRLG (Shared Risk Link Groups), 621–622
SST (Slow Start Threshold), 133, 134
state velocity, slowing down, 525–526, 548
有状态 DHCP(动态主机配置协议),158
stateful DHCP (Dynamic Host Configuration Protocol), 158
stateless DHCPv6 (Dynamic Host Configuration Protocol version 6), 158
STK(源地址令牌)、QUIC 和启动握手,137
STK (Source Address Tokens), QUIC and startup handshakes, 137
贮存
storage
压缩, 657
compression, 657
融合网络,655
converged networks, 655
重复数据删除,657
data deduplication, 657
分类网络,656
disaggregated networks, 656
加密,657
encryption, 657
超融合网络,656
hyperconverged networks, 656
remote storage, public clouds, 734–735
STP(生成树协议),402
STP (Spanning Tree Protocol), 402
as distance vector protocol, 408–409
reachable destinations, 407–408
拉伸(网络)
stretch (network)
control plane state versus, 28–29
定义, 28
defining, 28
subsidiarity, centralized control planes, 499–503
替代表、运输安全
substitution tables, transport security
cipher blocks as substitution tables, 253–255
substitution tables generated by large key transforms, 255–258
Suurballe’s disjoint path algorithms, 358–363
交换
switching
输入排队交换机,188
input-queued switches, 188
interrupt context switching, 183–186
路由与交换,177
routing versus switching, 177
对称密码学,260
symmetric cryptography, 260
TCP (Transmission Control Protocol), 117, 128–129
缓冲数据包,217
buffering packets, 217
卖旗日, 41
flag days, 41
流量控制
flow control
重传,132
retransmitting, 132
RWND,133
RWND, 133
麻袋,132
SACK, 132
windowed flow control with serial numbers, 130–131
知识产权开发,117
IP development, 117
OSI 模型,83
OSI model, 83
分组交换网络,流量控制,16
packet switched networks, flow control, 16
端口映射、层间发现、152
port mapping, interlayer discovery, 152
TCP/IP(传输控制协议/互联网协议),OSI 模型,83 – 84
TCP/IP (Transmission Control Protocol/Internet Protocol), OSI model, 83–84
TDM(时分复用)
TDM (Time Division Multiplexing)
circuit switched networks, 9–12
数据(转发)平面,12
data (forwarding) planes, 12
管理平面,12
management planes, 12
数据(转发)平面,12
data (forwarding) planes, 12
管理平面,12
management planes, 12
technical debt (troubleshooting), 646–647
测试、故障排除方法
tests, troubleshooting methods
威胁行为者(攻击者),定义,564
threat actors (attackers), defining, 564
威胁(攻击)
threats (attacks)
放大攻击,574
amplification attacks, 574
暴力攻击,258
brute-force attacks, 258
燃烧器攻击,574
burner attacks, 574
DDoS scrubbers/services, 581–582
防止、阻止半开放/格式错误的会话,575
preventing, blocking half-open/malformed sessions, 575
preventing, dispersing traffic over multiple servers, 576–577
preventing, filtering unroutable addresses, 578–579
防止、主机操作系统修改,575
preventing, host operating system modifications, 575
preventing, rate limiting, 575–576
定义, 564
defining, 564
man-in-the-middle attacks, 268–269
反射攻击,574
reflection attacks, 574
three-tier hierarchical network design, 600–601
三次握手
three-way handshakes
控制平面,291
control planes, 291
吞吐量与有效吞吐量,110
throughput versus goodput, 110
TLS (Transport Layer Security), 269–270
的组成部分,270
components of, 270
secure session startup process (handshakes), 270–272
TLV (Type Length Values), 21, 44–46, 47
Bellman-Ford 无环路路径计算,326
Bellman-Ford loop-free path calculation, 326
change distribution, 383, 394–395
centralized control planes, 389–390
flooded distributions, 383–387
hop-by-hop distributions, 387–389
检测变化,375
detecting changes, 375
event-driven failure detection, 377–378
复杂性,380
complexity, 380
polling-based failure detection versus, 378–379
hub-and-spoke topologies, 609–610
information hiding, summarizing topology information, 514–515, 530
IS-IS, topology discovery, 434–436
link state detection and BFD, 380–382
OSPF, topology discovery, 441–442
polling-based failure detection, 376–377, 378–379
环形拓扑,603
ring topologies, 603
弹性,605
resiliency, 605
交通工程,604
traffic engineering, 604
spine and leaf topologies, 667–668, 669
虚拟网络,222
virtual networks, 222
traffic classes, QoS, 199, 203
AF,202
AF, 202
最佳实践,200
best practices, 200
DSCP 突变,204
DSCP mutation, 204
以太网 DSCP 和 IPv4 ToS 字段,200 – 202
Ethernet DSCP and IPv4 ToS fields, 200–202
RFC2597,AF,202
RFC2597, AF, 202
信任边界,201
trust boundaries, 201
交通工程
traffic engineering
环形拓扑,604
ring topologies, 604
spine and leaf topologies, 670–671
traffic flows, optimizing (control plane policies), 473–474
流量整形、QoS 和拥塞管理,214
traffic shaping, QoS and congestion management, 214
中转节点,284
transit nodes, 284
传递信任,262
transitive trust, 262
传输层
transport layer
四层 DoD 模型,77
four-layer DoD model, 77
OSI 模型,83
OSI model, 83
transport security, 249–250, 272–273
暴力攻击,258
brute-force attacks, 258
密码学
cryptography
非对称密码学,260
asymmetric cryptography, 260
private key cryptography, 260–261, 262–263
public key cryptography, 260–263
对称密码学,260
symmetric cryptography, 260
data exhaust, 251–252, 264–265
加密
encryption
cipher blocks as substitution tables, 253–255
cryptographic functions, 255, 258, 259
MAC address randomization, 265–266
multiple rounds of encryption, 259–260
substitution tables generated by large key transforms, 255–258
examination, protecting data from, 250–251
柯克霍夫原理,257
Kerckhoff’s principle, 257
man-in-the-middle attacks, 268–269
的组成部分,270
components of, 270
secure session startup process (handshakes), 270–272
验证数据,250
validating data, 250
应用层,83
application layer, 83
数据链路层,82
data link layer, 82
以太网和,83
Ethernet and, 83
知识产权和,83
IP and, 83
表示层,83
presentation layer, 83
会话层,83
session layer, 83
TCP 和,83
TCP and, 83
传输层,83
transport layer, 83
树
trees
触发更新和 RIP,415
triggered updates and RIP, 415
troubleshooting, 629–630, 647–648
云计算,基础设施故障,729
cloud computing, infrastructure failures, 729
half-split method, 641–643, 645
型号, 633
models, 633
shifting between models, 639–641
network components, defining, 631–632
shifting between signals, 642–643
信任边界(QoS),201
trust boundaries (QoS), 201
TTL(生存时间)。查看 跳数
TTL (Time To Live). See hop counts
隧道协议,MPLS as,236
tunneling protocols, MPLS as, 236
两层分层网络设计,601
two-tier hierarchical network design, 601
双向握手
two-way handshakes
UDP(用户数据报协议),缓冲数据包,217
UDP (User Datagram Protocol), buffering packets, 217
小型网络,网络设计,596
undersized networks, network design, 596
Unicode 词典,42
Unicode dictionaries, 42
unikernels and IoT security, 748–749
unmarked Internet and QoS, 206–207
更新、触发更新、RIP 和,415
updates, triggered updates, RIP and, 415
升级
upgrades
forklift upgrades, network design, 594–595, 596
ISSU,623
ISSU, 623
uRPF(单播反向路径转发)
uRPF (Unicast Reverse Path Forwarding)
开放网络安全,750
open networking security, 750
preventing DDoS attacks, 578–579
usage versus features (network engineering), 8–9
user privacy, transport security, 251–252
验证数据,传输安全,250
validating data, transport security, 250
VBR(可变比特率),69
VBR (Variable Bit Rates), 69
VIP(Vines 互联网协议),16
VIP (Vines Internet Protocol), 16
虚拟网络, 245 – 246 , 701 , 702 – 703
virtual networks, 245–246, 701, 702–703
复杂性,241
complexity, 241
融合网络,655
converged networks, 655
数据中心防火墙集群,702
data center firewall clusters, 702
定义, 221
defining, 221
disaggregated networks, 654–656
Ethernet services over IP networks, 226–227
超融合网络,656
hyperconverged networks, 656
intent-based networking, 767–769
交互表面
interaction surfaces
overlaid control panels, 243–245
shared risk link groups, 242–243
负载平衡器,702
load-balancers, 702
multiplexing and, 222, 282–283
交互表面,718
interaction surfaces, 718
network design flexibility, 703–705
优化,718
optimization, 718
虚拟化服务,717
virtualized services, 717
物理网络转换到,654
physical network transitions to, 654
处理器, 657
processors, 657
拓扑结构,222
topologies, 222
越南国家部队,703
VNF, 703
第715章的好处
benefits of, 715
centralized policy management, 713–714
硬件卸载,717
hardware offload, 717
intent-based networking, 714–715
network design flexibility, 703–705
性能, 716
performance, 716
software optimization, 716–717
权衡,717
tradeoffs, 717
VPN
VPN
corporate networks and, 227–229
VNF(虚拟化网络功能),703
VNF (Virtualized Network Functions), 703
第715章的好处
benefits of, 715
centralized policy management, 713–714
硬件卸载,717
hardware offload, 717
intent-based networking, 714–715
network design flexibility, 703–705
性能, 716
performance, 716
software optimization, 716–717
权衡,717
tradeoffs, 717
VoIP(互联网协议语音)、QoS 和拥塞管理,208 – 210 , 217
VoIP (Voice over Internet Protocol), QoS and congestion management, 208–210, 217
VOQ (Virtual Output Queues), packet switching, 188–189
VPN(虚拟专用网络)
VPN (Virtual Private Networks)
corporate networks and, 227–229
VRF (Virtual Routing and Forwarding), 222, 225
漏洞,定义,564
vulnerabilities, defining, 564
WAN (Wide Area Networks), SD-WAN, 239–241
wasp waist, complexity (network), managing, 32–33
瀑布(大陆分水岭)模型,备用无环路路径,备用无环路路径,320 – 321
waterfall (continental divide) model, alternate loop-free paths, alternate loop-free paths, 320–321
信任网,262
web of trust, 262
what models (network troubleshooting), 635–637
白盒运动。查看 分类网络
white box movement. See disaggregated networks
窗口化, 65
windowing, 65
fixed window flow control, 67–69
single packet windows (ping pong), 65–68
Wireless 802.11, 102
错误管理,109
error management, 109
流量控制,109
flow control, 109
编组,109
marshaling, 109
多路复用,102
multiplexing, 102
multiple paths within a single room, 103–104
spatial multiplexing, 103–104, 106–107
wireless networks, hidden nodes, 108–109
XML(可扩展标记语言)
XML (Extensible Markup Language)